About the project

Write a short description about the course and add a link to your GitHub repository here. This is an R Markdown (.Rmd) file so you can use R Markdown syntax.

Ensimmäinen harjoitus

Ensimmäisessä harjoituksessa installoitiin tarvittavat ohjelmat ja yhdistettiin verkkoympäristöön. Itselläni oli ongelmia saada R-studio ja GIT keskustelemaan keskenään ja tästä syystä jouduin tekemään installaatiot useaan otteeseen. Näiden ongelmien jälkeen pääsi harjoittelemaan DATACAMP ympäristössä, mutta jostain syystä platform näkyy normaalia huomattavan pienenä ja en ole tätä vielä ratkaissut.

R-Harjoitukset

R-Harjoituksissa käytiin läpi hyvin perus R-käyttöä, mutta hyvin nopeasti lopussa syöksyttiin funktioihin. Mielestäni tässä olisi ollut hyvä olla useampi videoluento aiheesta. Jokatapauksessa erittäin mielenkintoinen konsepti. En ole ennen käyttänyt GIT iä tai Markdownia , joten näistäkin on hyvä saada kokemusta.

Oma Git


Työ 2

Describe the work you have done this week and summarize your learning.

Toinen harjoitus

DATA tuonti

data <- read.csv("learing_2019.csv")

#Summary
summary(data)
##        X          gender       Age           Attitude         Points     
##  Min.   :  1.00   F:110   Min.   :17.00   Min.   :14.00   Min.   : 7.00  
##  1st Qu.: 42.25   M: 56   1st Qu.:21.00   1st Qu.:26.00   1st Qu.:19.00  
##  Median : 83.50           Median :22.00   Median :32.00   Median :23.00  
##  Mean   : 83.50           Mean   :25.51   Mean   :31.43   Mean   :22.72  
##  3rd Qu.:124.75           3rd Qu.:27.00   3rd Qu.:37.00   3rd Qu.:27.75  
##  Max.   :166.00           Max.   :55.00   Max.   :50.00   Max.   :33.00  
##       Deep            Stra            Surf      
##  Min.   :1.583   Min.   :1.250   Min.   :1.583  
##  1st Qu.:3.333   1st Qu.:2.625   1st Qu.:2.417  
##  Median :3.667   Median :3.188   Median :2.833  
##  Mean   :3.680   Mean   :3.121   Mean   :2.787  
##  3rd Qu.:4.083   3rd Qu.:3.625   3rd Qu.:3.167  
##  Max.   :4.917   Max.   :5.000   Max.   :4.333
#Access ggplot and GGally
library(ggplot2)

library(GGally)
## Registered S3 method overwritten by 'GGally':
##   method from   
##   +.gg   ggplot2
#Creating graphics
p1 <- ggplot(data, aes(x = Attitude, y = Points, col = gender))

#and fitting linear model
p2 <- p1 + geom_point() + geom_smooth(method = "lm")
p2

In these graphs we can see that gender plays a role. Biggest difference between genders seems to be on variable attitude, where females have a clearly lower mean. The distributions of the attitude and surf variables differ between males and females.

#Correlations

p <- ggpairs(data, mapping = aes(col = gender, alpha = 0.3), lower = list(combo = wrap("facethist", bins = 20)))
p

## Regression model with multible explanatory variables
my_model <- lm(Points ~ Attitude + Stra + Surf, data = data)

# Summary of the model
summary(my_model)
## 
## Call:
## lm(formula = Points ~ Attitude + Stra + Surf, data = data)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -17.1550  -3.4346   0.5156   3.6401  10.8952 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 11.01711    3.68375   2.991  0.00322 ** 
## Attitude     0.33952    0.05741   5.913 1.93e-08 ***
## Stra         0.85313    0.54159   1.575  0.11716    
## Surf        -0.58607    0.80138  -0.731  0.46563    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.296 on 162 degrees of freedom
## Multiple R-squared:  0.2074, Adjusted R-squared:  0.1927 
## F-statistic: 14.13 on 3 and 162 DF,  p-value: 3.156e-08

Only attitude seems to be siqnificant in this fitted model. The multiple R squared of the model is in this case simply the square of the regression between Attitude and Points. Since it is around 0,2, approximately 20% of the variation of points can be explained by the variation of Attitude.

# drawing diagnostic plots using the plot() function. Residuals vs Fitted values, Normal QQ-plot and Residuals vs Leverage.

par(mfrow = c(2,2))
plot(my_model, which = c(1,2,5))

The assumptions of the model are that the error terms are approximately normally distributed with mean 0 and identical variation, uncorrelated and independent of the variable of interest. Specifically, the size of the error should not depend on the value of the explanatory or interesting variables.

From these pictures it seems that the model assumptions are approximately correct, with the small exception of small and large values of Points corresponding to some larger deviation from the estimated mean. Also, a couple of observations seem to have somewhat large leverages, but overall the assumptions of the model seem to hold quite well. Questionable is the ends of the spectrum, and additional analysis is needed.


Työ 3

Exercise 3

Mette Nissen 15.9.2019 Exercise 3, analysis with student alcohol consumption.

Exercise 3 #Exercise 3

Data Wrangling

alc <- read.csv("alc.csv", header = TRUE, sep = ",")


library(ggplot2)
library(GGally)
library(tidyverse)
## ── Attaching packages ──────────────────────────────────── tidyverse 1.3.0 ──
## ✔ tibble  2.1.3     ✔ dplyr   0.8.3
## ✔ tidyr   1.0.0     ✔ stringr 1.4.0
## ✔ readr   1.3.1     ✔ forcats 0.4.0
## ✔ purrr   0.3.3
## ── Conflicts ─────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
library(dplyr)
library(knitr)

Relationship between variables and high use

I have taken variables gender, number of school abscences, going out with friends and final grade in the analysis. Hypothesis is that people performing poorly in school and missing classes are in higher risk of using more alcohol.

Summary of variables

I also looked the summary and we can see that the mean age is 17 years. From the barchart we can see how high use is more common in men. THe next boxplot shows people having high use in alcohol going out in both genders. Also final grade seems to be lower. Using age as a function for abscences, we can see in the last figure how abscences seem to be higher in older men with high alcohol consumption.

Same things we can see from mean values. Group has 198 female and 184 male. Dividing groups with high use in females no high use vs high use is 156 and 42 respectevely and same numbers in men are 112 and 72, so higher proportion in men are high users. High use doesn´t affect final grade in female, but we can se a difference in grades in male students: 12.2 in non-high use group and 10.3 in high use group. Abscences are higher in both gender with high use and same is seen in going out with friends see tabels mean abscences and mean goout).

#Summary of the datatable

kable(summary(alc), digits = 2)
X school sex age address famsize Pstatus Medu Fedu Mjob Fjob reason nursery internet guardian traveltime studytime failures schoolsup famsup paid activities higher romantic famrel freetime goout Dalc Walc health absences G1 G2 G3 alc_use high_use
Min. : 1.00 GP:342 F:198 Min. :15.00 R: 81 GT3:278 A: 38 Min. :0.000 Min. :0.000 at_home : 53 at_home : 16 course :140 no : 72 no : 58 father: 91 Min. :1.000 Min. :1.000 Min. :0.0000 no :331 no :144 no :205 no :181 no : 18 no :261 Min. :1.000 Min. :1.00 Min. :1.000 Min. :1.000 Min. :1.000 Min. :1.000 Min. : 0.0 Min. : 2.00 Min. : 4.00 Min. : 0.00 Min. :1.000 Mode :logical
1st Qu.: 96.25 MS: 40 M:184 1st Qu.:16.00 U:301 LE3:104 T:344 1st Qu.:2.000 1st Qu.:2.000 health : 33 health : 17 home :110 yes:310 yes:324 mother:275 1st Qu.:1.000 1st Qu.:1.000 1st Qu.:0.0000 yes: 51 yes:238 yes:177 yes:201 yes:364 yes:121 1st Qu.:4.000 1st Qu.:3.00 1st Qu.:2.000 1st Qu.:1.000 1st Qu.:1.000 1st Qu.:3.000 1st Qu.: 1.0 1st Qu.:10.00 1st Qu.:10.00 1st Qu.:10.00 1st Qu.:1.000 FALSE:268
Median :191.50 NA NA Median :17.00 NA NA NA Median :3.000 Median :3.000 other :138 other :211 other : 34 NA NA other : 16 Median :1.000 Median :2.000 Median :0.0000 NA NA NA NA NA NA Median :4.000 Median :3.00 Median :3.000 Median :1.000 Median :2.000 Median :4.000 Median : 3.0 Median :12.00 Median :12.00 Median :12.00 Median :1.500 TRUE :114
Mean :191.50 NA NA Mean :16.59 NA NA NA Mean :2.806 Mean :2.565 services: 96 services:107 reputation: 98 NA NA NA Mean :1.448 Mean :2.037 Mean :0.2016 NA NA NA NA NA NA Mean :3.937 Mean :3.22 Mean :3.113 Mean :1.482 Mean :2.296 Mean :3.573 Mean : 4.5 Mean :11.49 Mean :11.47 Mean :11.46 Mean :1.889 NA
3rd Qu.:286.75 NA NA 3rd Qu.:17.00 NA NA NA 3rd Qu.:4.000 3rd Qu.:4.000 teacher : 62 teacher : 31 NA NA NA NA 3rd Qu.:2.000 3rd Qu.:2.000 3rd Qu.:0.0000 NA NA NA NA NA NA 3rd Qu.:5.000 3rd Qu.:4.00 3rd Qu.:4.000 3rd Qu.:2.000 3rd Qu.:3.000 3rd Qu.:5.000 3rd Qu.: 6.0 3rd Qu.:14.00 3rd Qu.:14.00 3rd Qu.:14.00 3rd Qu.:2.500 NA
Max. :382.00 NA NA Max. :22.00 NA NA NA Max. :4.000 Max. :4.000 NA NA NA NA NA NA Max. :4.000 Max. :4.000 Max. :3.0000 NA NA NA NA NA NA Max. :5.000 Max. :5.00 Max. :5.000 Max. :5.000 Max. :5.000 Max. :5.000 Max. :45.0 Max. :18.00 Max. :18.00 Max. :18.00 Max. :5.000 NA
# Graphs

p <- ggpairs(alc, columns = c("sex", "absences","goout", "G3", "high_use"), mapping = aes(col = sex, alpha = 0.3), lower = list(combo = wrap("facethist", bins = 20)))
p

?ggpairs
# initialize a barchart of alcohol use difference between genders
g1 <- ggplot(data = alc, aes(x=sex)) + geom_bar() + facet_wrap("high_use") + ggtitle("Student alcohol consumption by sex")
g1

# initialize a plot of alcohol use and going out with friends
g2 <- ggplot() + geom_boxplot(data = alc, aes(x = sex, y = goout)) + facet_wrap("high_use") + ggtitle("Student going out with friends by alcohol consumption and sex")
g2

# Boxplot of alcohol use and school final grade
g3 <- ggplot() + geom_boxplot(data = alc, aes(x = sex, y = G3)) + facet_wrap("high_use") + ggtitle("Student final grades by alcohol consumption and sex")
g3

# Scatterplot showing linear model between age and abscences separated by alcohol consumption
g4 <- ggplot(data = alc, aes(x = age, y = absences, color=sex, alpha = 0.3)) + geom_point() + facet_wrap("high_use")+ geom_jitter() + geom_smooth(method = "lm") + ggtitle("Student absences by alcohol consumption, sex and age")
g4

alc %>% group_by(sex) %>% summarise(count = n())
## # A tibble: 2 x 2
##   sex   count
##   <fct> <int>
## 1 F       198
## 2 M       184
alc %>% group_by(sex, high_use) %>% summarise(count = n(), mean_grade = mean(G3))
## # A tibble: 4 x 4
## # Groups:   sex [2]
##   sex   high_use count mean_grade
##   <fct> <lgl>    <int>      <dbl>
## 1 F     FALSE      156       11.4
## 2 F     TRUE        42       11.7
## 3 M     FALSE      112       12.2
## 4 M     TRUE        72       10.3
alc %>% group_by(sex, high_use) %>% summarise(count = n(), mean_absences = mean(absences))
## # A tibble: 4 x 4
## # Groups:   sex [2]
##   sex   high_use count mean_absences
##   <fct> <lgl>    <int>         <dbl>
## 1 F     FALSE      156          4.22
## 2 F     TRUE        42          6.79
## 3 M     FALSE      112          2.98
## 4 M     TRUE        72          6.12
alc %>% group_by(sex, high_use) %>% summarise(count = n(), mean_goout = mean(goout))
## # A tibble: 4 x 4
## # Groups:   sex [2]
##   sex   high_use count mean_goout
##   <fct> <lgl>    <int>      <dbl>
## 1 F     FALSE      156       2.96
## 2 F     TRUE        42       3.36
## 3 M     FALSE      112       2.71
## 4 M     TRUE        72       3.93
knitr::opts_chunk$set(echo = TRUE)

Logistic regression model

# glm() model
m <- glm(high_use ~ absences + sex + goout, data = alc, family = "binomial")

# the summary of the model
summary(m)
## 
## Call:
## glm(formula = high_use ~ absences + sex + goout, family = "binomial", 
##     data = alc)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.7871  -0.8153  -0.5446   0.8357   2.4740  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -4.16317    0.47506  -8.764  < 2e-16 ***
## absences     0.08418    0.02237   3.764 0.000167 ***
## sexM         0.95872    0.25459   3.766 0.000166 ***
## goout        0.72981    0.11970   6.097 1.08e-09 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 465.68  on 381  degrees of freedom
## Residual deviance: 387.75  on 378  degrees of freedom
## AIC: 395.75
## 
## Number of Fisher Scoring iterations: 4
# the coefficients of the model
coef(m)
## (Intercept)    absences        sexM       goout 
## -4.16317087  0.08418399  0.95871896  0.72980859
# the odds ratio (OR)
OR <- coef(m) %>% exp

# the confidence interval (CI)
CI <- confint(m) %>% exp
## Waiting for profiling to be done...
OR
## (Intercept)    absences        sexM       goout 
##  0.01555815  1.08782902  2.60835292  2.07468346
CI
##                   2.5 %     97.5 %
## (Intercept) 0.005885392 0.03804621
## absences    1.042458467 1.13933894
## sexM        1.593132148 4.33151387
## goout       1.650182481 2.64111050
# printing out the odds ratios with their confidence intervals
cbind(OR, CI)
##                     OR       2.5 %     97.5 %
## (Intercept) 0.01555815 0.005885392 0.03804621
## absences    1.08782902 1.042458467 1.13933894
## sexM        2.60835292 1.593132148 4.33151387
## goout       2.07468346 1.650182481 2.64111050

Analysis of GLM

From the model alone we can see that all other but G3 are statistically highly significant. Calculating OR and CI we can see that absences, male sex and going out have positive correlation and confidence interval staying above 1 stating significant correlation. In G3 correlation is negative and not significant. For the future predictions I am going to leave G3 from the model.

Prediction with the model

# probability of high_use
probabilities <- predict(m, type = "response")

# adding the predicted probabilities to 'alc'
alc <- mutate(alc, probability = probabilities)

# using the probabilities to make a prediction of high_use
alc <- mutate(alc, prediction = (probability > 0.5))

# the last ten original classes, predicted probabilities, and class predictions
select(alc, goout, absences, sex, high_use, probability, prediction) %>% tail(10)
##     goout absences sex high_use probability prediction
## 373     2        0   M    FALSE  0.14869987      FALSE
## 374     3        7   M     TRUE  0.39514446      FALSE
## 375     3        1   F    FALSE  0.13129452      FALSE
## 376     3        6   F    FALSE  0.18714923      FALSE
## 377     2        2   F    FALSE  0.07342805      FALSE
## 378     4        2   F    FALSE  0.25434555      FALSE
## 379     2        2   F    FALSE  0.07342805      FALSE
## 380     1        3   F    FALSE  0.03989428      FALSE
## 381     5        4   M     TRUE  0.68596604       TRUE
## 382     1        2   M     TRUE  0.09060457      FALSE
# target variable versus the predictions
table(high_use = alc$high_use, prediction = alc$prediction)
##         prediction
## high_use FALSE TRUE
##    FALSE   253   15
##    TRUE     65   49

The last ten cases show real values in high use TRUE 3 and FALSE 7. In the prediction these numbers are TRUE 1 and FALSE 9.

Model predictions in graphics

# a plot of 'high_use' versus 'probability' in 'alc'
pr <- ggplot(alc, aes(x = probability, y = high_use, col = prediction))
pr2 <- pr + geom_point() + ggtitle("Predictions")
pr2

Cross validation

Here is calculated the error of our model with cross-validation:

# the target variable versus the predictions
table(high_use = alc$high_use, prediction = alc$prediction)%>%prop.table %>% addmargins
##         prediction
## high_use      FALSE       TRUE        Sum
##    FALSE 0.66230366 0.03926702 0.70157068
##    TRUE  0.17015707 0.12827225 0.29842932
##    Sum   0.83246073 0.16753927 1.00000000
# a loss function 
loss_func <- function(class, prob) {
  n_wrong <- abs(class - prob) > 0.5
  mean(n_wrong)
}

# loss_func to compute the average number of wrong predictions in the (training) data
loss_func(class = alc$high_use, prob = alc$probability)
## [1] 0.2094241

It seems that the wrong prediction proportion in this model 21% is smaller than 26% in the dataCamp exercise.


Työ 4

Exercise 4

# access packages

library(MASS)
## 
## Attaching package: 'MASS'
## The following object is masked from 'package:dplyr':
## 
##     select
library(ggplot2)
library(GGally)
library(tidyverse)
library(dplyr)
library(knitr)
library(corrplot)
## corrplot 0.84 loaded
library(tidyr)
library(reshape2)
## 
## Attaching package: 'reshape2'
## The following object is masked from 'package:tidyr':
## 
##     smiths
library(plotly)
## 
## Attaching package: 'plotly'
## The following object is masked from 'package:MASS':
## 
##     select
## The following object is masked from 'package:ggplot2':
## 
##     last_plot
## The following object is masked from 'package:stats':
## 
##     filter
## The following object is masked from 'package:graphics':
## 
##     layout
# load the data
data("Boston")

Summary of the Boston Data

This data frame contains the following columns: crim = per capita crime rate by town. zn = proportion of residential land zoned for lots over 25,000 sq.ft. indus = proportion of non-retail business acres per town. chas = Charles River dummy variable (= 1 if tract bounds river; 0 otherwise). nox = nitrogen oxides concentration (parts per 10 million). rm = average number of rooms per dwelling. age = proportion of owner-occupied units built prior to 1940. dis = weighted mean of distances to five Boston employment centres. rad = index of accessibility to radial highways. tax = full-value property-tax rate per $10,000. ptratio = pupil-teacher ratio by town. black = 1000(Bk - 0.63)^2 where Bk is the proportion of blacks by town. lstat = lower status of the population (percent). medv = median value of owner-occupied homes in $1000s.

In this weeks exercise we use Boston data set from R MASS package which is a histoical data collected from 606 districts in the area around Boston.

Boston has 14 variables and 506 observations. Crime variable is the response variable.

Variables and their explanations are show above.

#Dataset summary and variables 

summary(Boston)
##       crim                zn             indus            chas        
##  Min.   : 0.00632   Min.   :  0.00   Min.   : 0.46   Min.   :0.00000  
##  1st Qu.: 0.08204   1st Qu.:  0.00   1st Qu.: 5.19   1st Qu.:0.00000  
##  Median : 0.25651   Median :  0.00   Median : 9.69   Median :0.00000  
##  Mean   : 3.61352   Mean   : 11.36   Mean   :11.14   Mean   :0.06917  
##  3rd Qu.: 3.67708   3rd Qu.: 12.50   3rd Qu.:18.10   3rd Qu.:0.00000  
##  Max.   :88.97620   Max.   :100.00   Max.   :27.74   Max.   :1.00000  
##       nox               rm             age              dis        
##  Min.   :0.3850   Min.   :3.561   Min.   :  2.90   Min.   : 1.130  
##  1st Qu.:0.4490   1st Qu.:5.886   1st Qu.: 45.02   1st Qu.: 2.100  
##  Median :0.5380   Median :6.208   Median : 77.50   Median : 3.207  
##  Mean   :0.5547   Mean   :6.285   Mean   : 68.57   Mean   : 3.795  
##  3rd Qu.:0.6240   3rd Qu.:6.623   3rd Qu.: 94.08   3rd Qu.: 5.188  
##  Max.   :0.8710   Max.   :8.780   Max.   :100.00   Max.   :12.127  
##       rad              tax           ptratio          black       
##  Min.   : 1.000   Min.   :187.0   Min.   :12.60   Min.   :  0.32  
##  1st Qu.: 4.000   1st Qu.:279.0   1st Qu.:17.40   1st Qu.:375.38  
##  Median : 5.000   Median :330.0   Median :19.05   Median :391.44  
##  Mean   : 9.549   Mean   :408.2   Mean   :18.46   Mean   :356.67  
##  3rd Qu.:24.000   3rd Qu.:666.0   3rd Qu.:20.20   3rd Qu.:396.23  
##  Max.   :24.000   Max.   :711.0   Max.   :22.00   Max.   :396.90  
##      lstat            medv      
##  Min.   : 1.73   Min.   : 5.00  
##  1st Qu.: 6.95   1st Qu.:17.02  
##  Median :11.36   Median :21.20  
##  Mean   :12.65   Mean   :22.53  
##  3rd Qu.:16.95   3rd Qu.:25.00  
##  Max.   :37.97   Max.   :50.00
str(Boston)
## 'data.frame':    506 obs. of  14 variables:
##  $ crim   : num  0.00632 0.02731 0.02729 0.03237 0.06905 ...
##  $ zn     : num  18 0 0 0 0 0 12.5 12.5 12.5 12.5 ...
##  $ indus  : num  2.31 7.07 7.07 2.18 2.18 2.18 7.87 7.87 7.87 7.87 ...
##  $ chas   : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ nox    : num  0.538 0.469 0.469 0.458 0.458 0.458 0.524 0.524 0.524 0.524 ...
##  $ rm     : num  6.58 6.42 7.18 7 7.15 ...
##  $ age    : num  65.2 78.9 61.1 45.8 54.2 58.7 66.6 96.1 100 85.9 ...
##  $ dis    : num  4.09 4.97 4.97 6.06 6.06 ...
##  $ rad    : int  1 2 2 3 3 3 5 5 5 5 ...
##  $ tax    : num  296 242 242 222 222 222 311 311 311 311 ...
##  $ ptratio: num  15.3 17.8 17.8 18.7 18.7 18.7 15.2 15.2 15.2 15.2 ...
##  $ black  : num  397 397 393 395 397 ...
##  $ lstat  : num  4.98 9.14 4.03 2.94 5.33 ...
##  $ medv   : num  24 21.6 34.7 33.4 36.2 28.7 22.9 27.1 16.5 18.9 ...
colnames(Boston)
##  [1] "crim"    "zn"      "indus"   "chas"    "nox"     "rm"      "age"    
##  [8] "dis"     "rad"     "tax"     "ptratio" "black"   "lstat"   "medv"
pairs(Boston)

head(Boston)
##      crim zn indus chas   nox    rm  age    dis rad tax ptratio  black lstat
## 1 0.00632 18  2.31    0 0.538 6.575 65.2 4.0900   1 296    15.3 396.90  4.98
## 2 0.02731  0  7.07    0 0.469 6.421 78.9 4.9671   2 242    17.8 396.90  9.14
## 3 0.02729  0  7.07    0 0.469 7.185 61.1 4.9671   2 242    17.8 392.83  4.03
## 4 0.03237  0  2.18    0 0.458 6.998 45.8 6.0622   3 222    18.7 394.63  2.94
## 5 0.06905  0  2.18    0 0.458 7.147 54.2 6.0622   3 222    18.7 396.90  5.33
## 6 0.02985  0  2.18    0 0.458 6.430 58.7 6.0622   3 222    18.7 394.12  5.21
##   medv
## 1 24.0
## 2 21.6
## 3 34.7
## 4 33.4
## 5 36.2
## 6 28.7
#Graphical summary of crime variable

ggplot(Boston, aes(crim)) + stat_density() + theme_bw()

#Plotting each variable against crime rate

bosmelt <- melt(Boston, id="crim")
ggplot(bosmelt, aes(x=value, y=crim))+
  facet_wrap(~variable, scales="free")+
  geom_point()

Graphical overview and summaries of variables

boxplot(Boston$crim, Boston$zn, Boston$indus, Boston$chas, Boston$nox, Boston$rm, Boston$age, Boston$dis, Boston$rad, Boston$tax, Boston$ptratio, Boston$black, Boston$lstat, Boston$medv, names = c("crim", "zn", "indus", "chas", "nox", "rm", "age", "dis", "rad", "tax", "ptratio", "black", "lstat", "medv"))

Making multible linear Regression model with all variables

mlm <- lm(formula = crim ~ ., data = Boston)
summary(mlm)
## 
## Call:
## lm(formula = crim ~ ., data = Boston)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -9.924 -2.120 -0.353  1.019 75.051 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  17.033228   7.234903   2.354 0.018949 *  
## zn            0.044855   0.018734   2.394 0.017025 *  
## indus        -0.063855   0.083407  -0.766 0.444294    
## chas         -0.749134   1.180147  -0.635 0.525867    
## nox         -10.313535   5.275536  -1.955 0.051152 .  
## rm            0.430131   0.612830   0.702 0.483089    
## age           0.001452   0.017925   0.081 0.935488    
## dis          -0.987176   0.281817  -3.503 0.000502 ***
## rad           0.588209   0.088049   6.680 6.46e-11 ***
## tax          -0.003780   0.005156  -0.733 0.463793    
## ptratio      -0.271081   0.186450  -1.454 0.146611    
## black        -0.007538   0.003673  -2.052 0.040702 *  
## lstat         0.126211   0.075725   1.667 0.096208 .  
## medv         -0.198887   0.060516  -3.287 0.001087 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 6.439 on 492 degrees of freedom
## Multiple R-squared:  0.454,  Adjusted R-squared:  0.4396 
## F-statistic: 31.47 on 13 and 492 DF,  p-value: < 2.2e-16

Multiple linear regression model intepratation

Most significant variables in the model are dis and rad with high significance, median value with moderate significance and zone, black with lower but still under p 0.05 significance.

Including Correlations

cor_matrix<-cor(Boston) 
cor_matrix %>% round(digits = 2)
##          crim    zn indus  chas   nox    rm   age   dis   rad   tax ptratio
## crim     1.00 -0.20  0.41 -0.06  0.42 -0.22  0.35 -0.38  0.63  0.58    0.29
## zn      -0.20  1.00 -0.53 -0.04 -0.52  0.31 -0.57  0.66 -0.31 -0.31   -0.39
## indus    0.41 -0.53  1.00  0.06  0.76 -0.39  0.64 -0.71  0.60  0.72    0.38
## chas    -0.06 -0.04  0.06  1.00  0.09  0.09  0.09 -0.10 -0.01 -0.04   -0.12
## nox      0.42 -0.52  0.76  0.09  1.00 -0.30  0.73 -0.77  0.61  0.67    0.19
## rm      -0.22  0.31 -0.39  0.09 -0.30  1.00 -0.24  0.21 -0.21 -0.29   -0.36
## age      0.35 -0.57  0.64  0.09  0.73 -0.24  1.00 -0.75  0.46  0.51    0.26
## dis     -0.38  0.66 -0.71 -0.10 -0.77  0.21 -0.75  1.00 -0.49 -0.53   -0.23
## rad      0.63 -0.31  0.60 -0.01  0.61 -0.21  0.46 -0.49  1.00  0.91    0.46
## tax      0.58 -0.31  0.72 -0.04  0.67 -0.29  0.51 -0.53  0.91  1.00    0.46
## ptratio  0.29 -0.39  0.38 -0.12  0.19 -0.36  0.26 -0.23  0.46  0.46    1.00
## black   -0.39  0.18 -0.36  0.05 -0.38  0.13 -0.27  0.29 -0.44 -0.44   -0.18
## lstat    0.46 -0.41  0.60 -0.05  0.59 -0.61  0.60 -0.50  0.49  0.54    0.37
## medv    -0.39  0.36 -0.48  0.18 -0.43  0.70 -0.38  0.25 -0.38 -0.47   -0.51
##         black lstat  medv
## crim    -0.39  0.46 -0.39
## zn       0.18 -0.41  0.36
## indus   -0.36  0.60 -0.48
## chas     0.05 -0.05  0.18
## nox     -0.38  0.59 -0.43
## rm       0.13 -0.61  0.70
## age     -0.27  0.60 -0.38
## dis      0.29 -0.50  0.25
## rad     -0.44  0.49 -0.38
## tax     -0.44  0.54 -0.47
## ptratio -0.18  0.37 -0.51
## black    1.00 -0.37  0.33
## lstat   -0.37  1.00 -0.74
## medv     0.33 -0.74  1.00
corrplot.mixed(cor_matrix, number.cex = .6)

Corrplot shows the relationships between variables. Highest positive correlation are between rad and tax, indus and nox and age and nox. Highest negative correlations are between age and dis, lstat and med and dis and nox. Wee can see from the summaries that distribution of the variables is very inconsistent and thus we need to scale the dataset before doing linear discriminant analysis later.

Standardizing the dataset

# center and standardize variables
boston_scaled <- scale(Boston)

# summaries of the scaled variables
summary(boston_scaled)
##       crim                 zn               indus              chas        
##  Min.   :-0.419367   Min.   :-0.48724   Min.   :-1.5563   Min.   :-0.2723  
##  1st Qu.:-0.410563   1st Qu.:-0.48724   1st Qu.:-0.8668   1st Qu.:-0.2723  
##  Median :-0.390280   Median :-0.48724   Median :-0.2109   Median :-0.2723  
##  Mean   : 0.000000   Mean   : 0.00000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.007389   3rd Qu.: 0.04872   3rd Qu.: 1.0150   3rd Qu.:-0.2723  
##  Max.   : 9.924110   Max.   : 3.80047   Max.   : 2.4202   Max.   : 3.6648  
##       nox                rm               age               dis         
##  Min.   :-1.4644   Min.   :-3.8764   Min.   :-2.3331   Min.   :-1.2658  
##  1st Qu.:-0.9121   1st Qu.:-0.5681   1st Qu.:-0.8366   1st Qu.:-0.8049  
##  Median :-0.1441   Median :-0.1084   Median : 0.3171   Median :-0.2790  
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.5981   3rd Qu.: 0.4823   3rd Qu.: 0.9059   3rd Qu.: 0.6617  
##  Max.   : 2.7296   Max.   : 3.5515   Max.   : 1.1164   Max.   : 3.9566  
##       rad               tax             ptratio            black        
##  Min.   :-0.9819   Min.   :-1.3127   Min.   :-2.7047   Min.   :-3.9033  
##  1st Qu.:-0.6373   1st Qu.:-0.7668   1st Qu.:-0.4876   1st Qu.: 0.2049  
##  Median :-0.5225   Median :-0.4642   Median : 0.2746   Median : 0.3808  
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 1.6596   3rd Qu.: 1.5294   3rd Qu.: 0.8058   3rd Qu.: 0.4332  
##  Max.   : 1.6596   Max.   : 1.7964   Max.   : 1.6372   Max.   : 0.4406  
##      lstat              medv        
##  Min.   :-1.5296   Min.   :-1.9063  
##  1st Qu.:-0.7986   1st Qu.:-0.5989  
##  Median :-0.1811   Median :-0.1449  
##  Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.6024   3rd Qu.: 0.2683  
##  Max.   : 3.5453   Max.   : 2.9865
# change the object to data frame
boston_scaled <- as.data.frame(boston_scaled)

summary(boston_scaled)
##       crim                 zn               indus              chas        
##  Min.   :-0.419367   Min.   :-0.48724   Min.   :-1.5563   Min.   :-0.2723  
##  1st Qu.:-0.410563   1st Qu.:-0.48724   1st Qu.:-0.8668   1st Qu.:-0.2723  
##  Median :-0.390280   Median :-0.48724   Median :-0.2109   Median :-0.2723  
##  Mean   : 0.000000   Mean   : 0.00000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.007389   3rd Qu.: 0.04872   3rd Qu.: 1.0150   3rd Qu.:-0.2723  
##  Max.   : 9.924110   Max.   : 3.80047   Max.   : 2.4202   Max.   : 3.6648  
##       nox                rm               age               dis         
##  Min.   :-1.4644   Min.   :-3.8764   Min.   :-2.3331   Min.   :-1.2658  
##  1st Qu.:-0.9121   1st Qu.:-0.5681   1st Qu.:-0.8366   1st Qu.:-0.8049  
##  Median :-0.1441   Median :-0.1084   Median : 0.3171   Median :-0.2790  
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.5981   3rd Qu.: 0.4823   3rd Qu.: 0.9059   3rd Qu.: 0.6617  
##  Max.   : 2.7296   Max.   : 3.5515   Max.   : 1.1164   Max.   : 3.9566  
##       rad               tax             ptratio            black        
##  Min.   :-0.9819   Min.   :-1.3127   Min.   :-2.7047   Min.   :-3.9033  
##  1st Qu.:-0.6373   1st Qu.:-0.7668   1st Qu.:-0.4876   1st Qu.: 0.2049  
##  Median :-0.5225   Median :-0.4642   Median : 0.2746   Median : 0.3808  
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 1.6596   3rd Qu.: 1.5294   3rd Qu.: 0.8058   3rd Qu.: 0.4332  
##  Max.   : 1.6596   Max.   : 1.7964   Max.   : 1.6372   Max.   : 0.4406  
##      lstat              medv        
##  Min.   :-1.5296   Min.   :-1.9063  
##  1st Qu.:-0.7986   1st Qu.:-0.5989  
##  Median :-0.1811   Median :-0.1449  
##  Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.6024   3rd Qu.: 0.2683  
##  Max.   : 3.5453   Max.   : 2.9865

Scaling effects

With standardizing data is centralized. This is done to continuous variables on unit scale by subtracting the mean of the variable and dividing the result by the variable’s standard deviation. With this variables´mean is 0 and SD is 1.

Creating categorical variable of the crime rate

# creating a quantile vector of crim 
bins <- quantile(boston_scaled$crim)

crime <- cut(boston_scaled$crim, breaks = bins, labels = c("low", "med_low", "med_high", "high"), include.lowest = TRUE)

table(crime)
## crime
##      low  med_low med_high     high 
##      127      126      126      127
#removing crim
boston_scaled <- dplyr::select(boston_scaled, -crim)

#adding categorical variable to the table
boston_scaled <- data.frame(boston_scaled, crime)

Creating training set and set test

For predicting with data we need a model training set which is in this case decided to be 80% of the cases and the rest of the data is used as a test set which shows the accuracy of the model.

n <- nrow(boston_scaled)

#Choosing 80% to the training set
ind <- sample(n,  size = n * 0.8)

train <- boston_scaled[ind,]

# creating the test set 
test <- boston_scaled[-ind,]

Fitting linear discriminant analysis on the training set using crime rate as the target variable

# linear discriminant analysis
lda.fit <- lda(crime ~ ., data = train)

# print the lda.fit object
lda.fit
## Call:
## lda(crime ~ ., data = train)
## 
## Prior probabilities of groups:
##       low   med_low  med_high      high 
## 0.2351485 0.2648515 0.2599010 0.2400990 
## 
## Group means:
##                   zn      indus        chas        nox          rm        age
## low       0.99382637 -0.9425846 -0.06511327 -0.9138794  0.45964590 -0.8838782
## med_low  -0.09012577 -0.3210174 -0.05155709 -0.5736368 -0.10200686 -0.3442029
## med_high -0.39658567  0.2242815  0.14012905  0.4296007  0.07222506  0.5034917
## high     -0.48724019  1.0149946 -0.02879709  1.0600922 -0.35441183  0.8337612
##                 dis        rad        tax    ptratio      black       lstat
## low       0.9579618 -0.7038213 -0.7287184 -0.4880429  0.3751934 -0.77868566
## med_low   0.3602152 -0.5439511 -0.4654886 -0.1055134  0.3189766 -0.15801437
## med_high -0.4122934 -0.4185755 -0.2935570 -0.2574838  0.1651912  0.06980203
## high     -0.8543703  1.6596029  1.5294129  0.8057784 -0.8461899  0.95182486
##                medv
## low       0.5095235
## med_low   0.0203129
## med_high  0.1405777
## high     -0.7993128
## 
## Coefficients of linear discriminants:
##                  LD1          LD2         LD3
## zn       0.122170990  0.586964270 -1.14083441
## indus    0.005181891 -0.341794759  0.22040463
## chas    -0.075824154  0.051400685  0.06966214
## nox      0.237745167 -0.790955322 -1.34198690
## rm      -0.084634265 -0.014945171 -0.11799973
## age      0.332732763 -0.397683317 -0.27579868
## dis     -0.114669377 -0.179384614  0.02969404
## rad      3.385281510  0.712762323 -0.05839673
## tax      0.011554485  0.326727853  0.58349973
## ptratio  0.128303202  0.003442482 -0.41136016
## black   -0.147418275 -0.015178506  0.12492664
## lstat    0.174540976 -0.187036271  0.31762396
## medv     0.135728495 -0.444657212 -0.26665753
## 
## Proportion of trace:
##    LD1    LD2    LD3 
## 0.9450 0.0423 0.0127

LDA biplot

# the function for lda biplot arrows
lda.arrows <- function(x, myscale = 1, arrow_heads = 0.1, color = "red", tex = 0.75, choices = c(1,2)){
  heads <- coef(x)
  arrows(x0 = 0, y0 = 0, 
         x1 = myscale * heads[,choices[1]], 
         y1 = myscale * heads[,choices[2]], col=color, length = arrow_heads)
  text(myscale * heads[,choices], labels = row.names(heads), 
       cex = tex, col=color, pos=3)
}

# target classes as numeric
classes <- as.numeric(train$crime)

# plot the lda results
l <- plot(lda.fit, dimen = 2, col = classes, pch = classes)
l + lda.arrows(lda.fit, myscale = 1)

## integer(0)

Prediction based on the lda model

# saving the correct classes from test data
correct_classes <- test$crime

# removing the crime variable from test data
test <- dplyr::select(test, -crime)

# predict classes with test data
lda.pred <- predict(lda.fit, newdata = test)

# cross tabulate the results
table(correct = correct_classes, predicted = lda.pred$class)
##           predicted
## correct    low med_low med_high high
##   low       15      12        5    0
##   med_low    2      14        3    0
##   med_high   2       8       10    1
##   high       0       0        1   29

From the cross table we can see that high values are predicted very nicely, but in the lower classes more errors occure.

Clustering, distances with euclidean distance

# Boston dataset reading and standardization again

data("Boston")
b_boston_scaled <- scale(Boston)

# Distances with euclidean distance
dist_eu <- dist(b_boston_scaled)
summary(dist_eu)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.1343  3.4625  4.8241  4.9111  6.1863 14.3970

Clustering with K means with 3 cluster centers

km <- kmeans(b_boston_scaled, centers = 3)
set.seed(123)
k_max <- 10
twcss <- sapply(1:k_max, function(k){kmeans(b_boston_scaled, k)$tot.withinss})
qplot(x = 1:k_max, y = twcss, geom = 'line')

The optimal cluster size is the point where the line drops. In this it seems to be two.

# Clustering again
km2 <- kmeans(b_boston_scaled, centers = 2)
pairs(b_boston_scaled[,1:7], col = km2$cluster)

pairs(b_boston_scaled[,8:14], col = km2$cluster)

Bonus

km3 <- kmeans(Boston, centers = 3)
set.seed(123)
k_max <- 10
twcss <- sapply(1:k_max, function(k){kmeans(Boston, k)$tot.withinss})
qplot(x = 1:k_max, y = twcss, geom = 'line')

km4 <- kmeans(Boston, centers = 2)
pairs(Boston[,1:7], col = km4$cluster)

pairs(Boston[,8:14], col = km4$cluster)

bins <- quantile(Boston$crim)

crime2 <- cut(Boston$crim, breaks = bins, labels = c("low", "med_low", "med_high", "high"), include.lowest = TRUE)

table(crime2)
## crime2
##      low  med_low med_high     high 
##      127      126      126      127
Boston <- dplyr::select(Boston, -crim)

Boston <- data.frame(Boston, crime2)

u <- nrow(Boston)

ind2 <- sample(u,  size = u * 0.8)

train2 <- Boston[ind2,]

test2 <- Boston[-ind2,]

lda.fit2 <- lda(crime2 ~ ., data = train2)

lda.fit2
## Call:
## lda(crime2 ~ ., data = train2)
## 
## Prior probabilities of groups:
##       low   med_low  med_high      high 
## 0.2524752 0.2500000 0.2351485 0.2623762 
## 
## Group means:
##                 zn     indus       chas       nox       rm      age      dis
## low      33.348039  5.173333 0.03921569 0.4534382 6.584784 43.77549 5.657954
## med_low   8.094059  9.092871 0.04950495 0.4921772 6.197436 59.68119 4.467285
## med_high  2.547368 12.335158 0.15789474 0.6040526 6.402337 81.93684 2.911674
## high      0.000000 18.100000 0.05660377 0.6798208 6.005179 91.18208 2.027506
##                rad      tax  ptratio    black     lstat     medv
## low       3.490196 282.8333 17.54412 390.9058  7.175784 27.00098
## med_low   4.742574 321.8812 18.31683 385.1081 11.669109 22.51485
## med_high  5.736842 349.4211 17.51789 364.4000 12.678211 24.52947
## high     24.000000 666.0000 20.20000 278.7887 18.729623 15.95849
## 
## Coefficients of linear discriminants:
##                   LD1           LD2           LD3
## zn       0.0059594595  0.0291392963  -0.043654152
## indus    0.0116837992 -0.0194328871   0.028512310
## chas    -0.4398527204 -0.4347835204  -0.196120764
## nox      1.9916327959 -5.6218782002 -10.346653879
## rm      -0.1892416869 -0.1153374711  -0.261021540
## age      0.0080682351 -0.0142044048  -0.007142634
## dis     -0.0553948278 -0.0682297733   0.054422317
## rad      0.4673782227  0.1037951245  -0.004063668
## tax      0.0004496804 -0.0003602709   0.002638328
## ptratio  0.0678345162  0.0524262621  -0.082236100
## black   -0.0015035340  0.0003945387   0.001896470
## lstat    0.0294406326 -0.0448026121   0.058101833
## medv     0.0234646334 -0.0462138320  -0.009722294
## 
## Proportion of trace:
##    LD1    LD2    LD3 
## 0.9645 0.0275 0.0081
lda.arrows2 <- function(x, myscale = 1, arrow_heads = 0.1, color = "red", tex = 0.75, choices = c(1,2)){
  heads <- coef(x)
  arrows(x0 = 0, y0 = 0, 
         x1 = myscale * heads[,choices[1]], 
         y1 = myscale * heads[,choices[2]], col=color, length = arrow_heads)
  text(myscale * heads[,choices], labels = row.names(heads), 
       cex = tex, col=color, pos=3)
}
classes2 <- as.numeric(train2$crime2)

# plot the lda results
l <- plot(lda.fit2, dimen = 2, col = classes2, pch = classes2)
l + lda.arrows2(lda.fit2, myscale = 2)
## Warning in arrows(x0 = 0, y0 = 0, x1 = myscale * heads[, choices[1]], y1 =
## myscale * : zero-length arrow is of indeterminate angle and so skipped

## Warning in arrows(x0 = 0, y0 = 0, x1 = myscale * heads[, choices[1]], y1 =
## myscale * : zero-length arrow is of indeterminate angle and so skipped

## integer(0)

Nox and seems to be the most influencal linear separators in analysis without standardization.

Super Bonus

I don´t get the colors right. Otherwise nice

model_predictors <- dplyr::select(train, -crime)
# check the dimensions
dim(model_predictors)
## [1] 404  13
dim(lda.fit$scaling)
## [1] 13  3
# matrix multiplication
matrix_product <- as.matrix(model_predictors) %*% lda.fit$scaling
matrix_product <- as.data.frame(matrix_product)

Chapter 5. Dimensionality reduction techniques

This week we are using human dataset from the United Nations Development Programme. From this dataset we have selected 8 variables wich are: country - Country Edu2.FM - secundry education rate female to male ratio Labo.FM - Labour force participation rate female to male ratio Life.Exp - Life expectancy at birth GNI - Gender Development Index Mat.Mor - Maternal mortality ratio Ado.Birth - Adolescent birth rate Parli.F - Percent of female representation in Parliament

Data exploration

human <- read.csv("data/human.csv", row.names = 1)

library(GGally)
library(tidyverse)
library(corrplot)
library(dplyr)
library(knitr)
library(corrplot)
library(tidyr)
library(reshape2)
library(plotly)
library(FactoMineR)
library(ggplot2)


str(human)
## 'data.frame':    148 obs. of  8 variables:
##  $ Edu2.FM  : num  1.007 0.997 0.983 0.989 0.969 ...
##  $ Labo.FM  : num  0.891 0.819 0.825 0.884 0.829 ...
##  $ Edu.Exp  : num  17.5 20.2 15.8 18.7 17.9 16.5 18.6 16.5 15.9 19.2 ...
##  $ Life.Exp : num  81.6 82.4 83 80.2 81.6 80.9 80.9 79.1 82 81.8 ...
##  $ GNI      : int  64992 42261 56431 44025 45435 43919 39568 52947 42155 32689 ...
##  $ Mat.Mor  : int  4 6 6 5 6 7 9 28 11 8 ...
##  $ Ado.Birth: num  7.8 12.1 1.9 5.1 6.2 3.8 8.2 31 14.5 25.3 ...
##  $ Parli.F  : num  39.6 30.5 28.5 38 36.9 36.9 19.9 19.4 28.2 31.4 ...
summary(human)
##     Edu2.FM          Labo.FM          Edu.Exp         Life.Exp    
##  Min.   :0.1980   Min.   :0.1857   Min.   : 7.00   Min.   :49.00  
##  1st Qu.:0.8097   1st Qu.:0.5971   1st Qu.:11.57   1st Qu.:68.38  
##  Median :0.9444   Median :0.7504   Median :13.55   Median :74.35  
##  Mean   :0.8766   Mean   :0.7006   Mean   :13.42   Mean   :72.44  
##  3rd Qu.:0.9990   3rd Qu.:0.8385   3rd Qu.:15.32   3rd Qu.:77.45  
##  Max.   :1.4967   Max.   :1.0380   Max.   :20.20   Max.   :83.50  
##       GNI            Mat.Mor        Ado.Birth         Parli.F     
##  Min.   :   680   Min.   :  1.0   Min.   :  0.60   Min.   : 0.00  
##  1st Qu.:  5190   1st Qu.: 11.0   1st Qu.: 12.18   1st Qu.:12.15  
##  Median : 12408   Median : 45.5   Median : 31.30   Median :19.50  
##  Mean   : 18402   Mean   :120.9   Mean   : 43.72   Mean   :20.95  
##  3rd Qu.: 25350   3rd Qu.:170.0   3rd Qu.: 69.05   3rd Qu.:27.82  
##  Max.   :123124   Max.   :730.0   Max.   :175.60   Max.   :57.50
pairs <- ggpairs(human)
pairs

# compute the correlation matrix and visualize it with corrplot
cor(human) %>% corrplot()

From the summary we can see that distributions differ between variables. GNI min is 680 and max is 123124 with comparison of Labo.FM with range of 0.1857-1.0380. In the dataset values are different. Correlation between variables differ from strongly positive correlation (maternal mortality - rate of adolescent women giving birth) to negative corelation (maternal mortality - life expectancy).

Principal component analysis (PCA) on the not standardized human data

#PCA analysis
pca_human <- prcomp(human)
pca_human
## Standard deviations (1, .., p=8):
## [1] 1.862483e+04 1.420218e+02 2.116337e+01 1.147634e+01 3.608478e+00
## [6] 1.571358e+00 1.923557e-01 1.512244e-01
## 
## Rotation (n x k) = (8 x 8):
##                     PC1           PC2           PC3           PC4           PC5
## Edu2.FM   -4.649714e-06  0.0006867686 -0.0013662633  2.317587e-04 -0.0055155087
## Labo.FM   -9.719607e-08 -0.0003158778  0.0003015743  4.425582e-03  0.0052146757
## Edu.Exp   -8.706739e-05  0.0087580112  0.0053006118  3.255666e-02  0.1415081241
## Life.Exp  -2.533543e-04  0.0333850374 -0.0048336882  7.494569e-02  0.9861585472
## GNI       -9.999893e-01 -0.0046050756 -0.0003728181 -6.050839e-05 -0.0001023982
## Mat.Mor    4.477844e-03 -0.9863064242  0.1609608588  1.124047e-02  0.0337070459
## Ado.Birth  1.111169e-03 -0.1611663247 -0.9864335638 -3.008031e-02  0.0039566221
## Parli.F   -5.580863e-05  0.0034657989 -0.0314144646  9.961287e-01 -0.0791032655
##                     PC6           PC7           PC8
## Edu2.FM    1.762118e-02  6.384062e-01  7.694765e-01
## Labo.FM    3.217052e-02  7.688482e-01 -6.385848e-01
## Edu.Exp    9.886674e-01 -3.580796e-02  8.073940e-03
## Life.Exp  -1.437809e-01  4.420684e-03  6.632638e-03
## GNI       -2.864828e-05 -1.086468e-06 -1.752803e-06
## Mat.Mor    2.675068e-03  1.432260e-04  1.224251e-03
## Ado.Birth  7.122449e-03 -7.522696e-04 -1.109192e-03
## Parli.F   -2.145731e-02 -2.750969e-03  1.847856e-03
#Summary of principal component analysis
s <- summary(pca_human)

# rounded percentages of variance captured by each PC
pca_pr <- round(100*s$importance[2,], digits = 1)

# creating an object pc_lab to be used as axis labels
pc_lab <- paste0(names(pca_pr), " (", pca_pr, "%)")

# drawing a biplot of the principal component representation and the original variables
biplot(pca_human, choices = 1:2, cex = c(0.8, 1), col = c("grey40", "deeppink"), xlab = pc_lab[1], ylab = pc_lab[2], main = (title = "PCA_non-scaled"))
## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length =
## arrow.len): zero-length arrow is of indeterminate angle and so skipped

## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length =
## arrow.len): zero-length arrow is of indeterminate angle and so skipped

## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length =
## arrow.len): zero-length arrow is of indeterminate angle and so skipped

## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length =
## arrow.len): zero-length arrow is of indeterminate angle and so skipped

## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length =
## arrow.len): zero-length arrow is of indeterminate angle and so skipped

From not scaled data pca is not useful because of large impact of GNI which has the largest SD and resulting first components 100% of variance. Therefore we must scale the data to make valid analysis.

PCA on standardized data

# standardize the variables
human_std <- scale(human)
summary(human_std)
##     Edu2.FM           Labo.FM           Edu.Exp            Life.Exp      
##  Min.   :-3.1089   Min.   :-2.6159   Min.   :-2.42684   Min.   :-3.0723  
##  1st Qu.:-0.3065   1st Qu.:-0.5256   1st Qu.:-0.69805   1st Qu.:-0.5329  
##  Median : 0.3108   Median : 0.2531   Median : 0.04826   Median : 0.2503  
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.00000   Mean   : 0.0000  
##  3rd Qu.: 0.5610   3rd Qu.: 0.7005   3rd Qu.: 0.71899   3rd Qu.: 0.6566  
##  Max.   : 2.8414   Max.   : 1.7144   Max.   : 2.56114   Max.   : 1.4495  
##       GNI             Mat.Mor          Ado.Birth          Parli.F       
##  Min.   :-0.9515   Min.   :-0.7355   Min.   :-1.1573   Min.   :-1.8197  
##  1st Qu.:-0.7094   1st Qu.:-0.6742   1st Qu.:-0.8466   1st Qu.:-0.7643  
##  Median :-0.3218   Median :-0.4626   Median :-0.3333   Median :-0.1258  
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.3730   3rd Qu.: 0.3009   3rd Qu.: 0.6799   3rd Qu.: 0.5973  
##  Max.   : 5.6228   Max.   : 3.7352   Max.   : 3.5397   Max.   : 3.1749
#PCA analysis with scaled variables
pca_human_std <- prcomp(human_std)
pca_human_std
## Standard deviations (1, .., p=8):
## [1] 2.0333042 1.1429035 0.8966375 0.7974445 0.7016064 0.5533825 0.4782085
## [8] 0.3039769
## 
## Rotation (n x k) = (8 x 8):
##                   PC1         PC2         PC3         PC4        PC5
## Edu2.FM   -0.32465414  0.09227181 -0.43377536  0.71962839 -0.3646631
## Labo.FM    0.02176326  0.72972518 -0.51145278 -0.19826256  0.3349423
## Edu.Exp   -0.42571462  0.14257522 -0.03704407 -0.14368290  0.1144547
## Life.Exp  -0.44696256 -0.02108739  0.13444430 -0.07252202  0.1544074
## GNI       -0.36191020  0.03284391 -0.10502738 -0.55841503 -0.6927543
## Mat.Mor    0.44847613  0.16196165 -0.07472686 -0.20980104 -0.2594996
## Ado.Birth  0.41583795  0.14603035 -0.06162895  0.13254030 -0.3781145
## Parli.F   -0.08992529  0.62416308  0.71441898  0.20859522 -0.1663542
##                    PC6         PC7         PC8
## Edu2.FM   -0.102011694  0.06004801  0.18184654
## Labo.FM   -0.113322727 -0.17601955 -0.10061935
## Edu.Exp    0.611088764  0.62425612  0.01404528
## Life.Exp   0.296297409 -0.63847349  0.50711229
## GNI       -0.176468208 -0.08563262 -0.16340648
## Mat.Mor    0.004512949  0.24190007  0.77276191
## Ado.Birth  0.683004537 -0.31602881 -0.27395047
## Parli.F   -0.133691243  0.04841963 -0.02315017
# rounded percentages of variance captured by each PC
s_st <- summary(pca_human_std)
pca_pr_st <- round(100*s_st$importance[2,], digits = 1)

# creating an object pc_lab to be used as axis labels
pc_lab_st <- paste0(names(pca_pr_st), " (", pca_pr_st, "%)")

# drawing a biplot of the principal component representation and the original variables
biplot(pca_human_std, choices = 1:2, cex = c(0.6, 0.8), col = c("grey40", "blue"), xlab = pc_lab_st[1], ylab = pc_lab_st[2], main = (title = "PCA_scaled"))

With scaling analysis shows a more reliable result. Rwanda seems to be an outlier in this 2-dimensional biplot. PC1 + PC2 are accounted for 68% of the variance which is quite a lot.

Arrow show similar effect in GII - gender inequality categories (labour and parliament) and welfare categories (education, income, life expectancy and maternal health). Seeing the angles between these groups, it seems that they correlate quite well with each other.

And now for somethin completely different - Tea

data(tea)

# What is tea all about

dim(tea)
## [1] 300  36
str(tea)
## 'data.frame':    300 obs. of  36 variables:
##  $ breakfast       : Factor w/ 2 levels "breakfast","Not.breakfast": 1 1 2 2 1 2 1 2 1 1 ...
##  $ tea.time        : Factor w/ 2 levels "Not.tea time",..: 1 1 2 1 1 1 2 2 2 1 ...
##  $ evening         : Factor w/ 2 levels "evening","Not.evening": 2 2 1 2 1 2 2 1 2 1 ...
##  $ lunch           : Factor w/ 2 levels "lunch","Not.lunch": 2 2 2 2 2 2 2 2 2 2 ...
##  $ dinner          : Factor w/ 2 levels "dinner","Not.dinner": 2 2 1 1 2 1 2 2 2 2 ...
##  $ always          : Factor w/ 2 levels "always","Not.always": 2 2 2 2 1 2 2 2 2 2 ...
##  $ home            : Factor w/ 2 levels "home","Not.home": 1 1 1 1 1 1 1 1 1 1 ...
##  $ work            : Factor w/ 2 levels "Not.work","work": 1 1 2 1 1 1 1 1 1 1 ...
##  $ tearoom         : Factor w/ 2 levels "Not.tearoom",..: 1 1 1 1 1 1 1 1 1 2 ...
##  $ friends         : Factor w/ 2 levels "friends","Not.friends": 2 2 1 2 2 2 1 2 2 2 ...
##  $ resto           : Factor w/ 2 levels "Not.resto","resto": 1 1 2 1 1 1 1 1 1 1 ...
##  $ pub             : Factor w/ 2 levels "Not.pub","pub": 1 1 1 1 1 1 1 1 1 1 ...
##  $ Tea             : Factor w/ 3 levels "black","Earl Grey",..: 1 1 2 2 2 2 2 1 2 1 ...
##  $ How             : Factor w/ 4 levels "alone","lemon",..: 1 3 1 1 1 1 1 3 3 1 ...
##  $ sugar           : Factor w/ 2 levels "No.sugar","sugar": 2 1 1 2 1 1 1 1 1 1 ...
##  $ how             : Factor w/ 3 levels "tea bag","tea bag+unpackaged",..: 1 1 1 1 1 1 1 1 2 2 ...
##  $ where           : Factor w/ 3 levels "chain store",..: 1 1 1 1 1 1 1 1 2 2 ...
##  $ price           : Factor w/ 6 levels "p_branded","p_cheap",..: 4 6 6 6 6 3 6 6 5 5 ...
##  $ age             : int  39 45 47 23 48 21 37 36 40 37 ...
##  $ sex             : Factor w/ 2 levels "F","M": 2 1 1 2 2 2 2 1 2 2 ...
##  $ SPC             : Factor w/ 7 levels "employee","middle",..: 2 2 4 6 1 6 5 2 5 5 ...
##  $ Sport           : Factor w/ 2 levels "Not.sportsman",..: 2 2 2 1 2 2 2 2 2 1 ...
##  $ age_Q           : Factor w/ 5 levels "15-24","25-34",..: 3 4 4 1 4 1 3 3 3 3 ...
##  $ frequency       : Factor w/ 4 levels "1/day","1 to 2/week",..: 1 1 3 1 3 1 4 2 3 3 ...
##  $ escape.exoticism: Factor w/ 2 levels "escape-exoticism",..: 2 1 2 1 1 2 2 2 2 2 ...
##  $ spirituality    : Factor w/ 2 levels "Not.spirituality",..: 1 1 1 2 2 1 1 1 1 1 ...
##  $ healthy         : Factor w/ 2 levels "healthy","Not.healthy": 1 1 1 1 2 1 1 1 2 1 ...
##  $ diuretic        : Factor w/ 2 levels "diuretic","Not.diuretic": 2 1 1 2 1 2 2 2 2 1 ...
##  $ friendliness    : Factor w/ 2 levels "friendliness",..: 2 2 1 2 1 2 2 1 2 1 ...
##  $ iron.absorption : Factor w/ 2 levels "iron absorption",..: 2 2 2 2 2 2 2 2 2 2 ...
##  $ feminine        : Factor w/ 2 levels "feminine","Not.feminine": 2 2 2 2 2 2 2 1 2 2 ...
##  $ sophisticated   : Factor w/ 2 levels "Not.sophisticated",..: 1 1 1 2 1 1 1 2 2 1 ...
##  $ slimming        : Factor w/ 2 levels "No.slimming",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ exciting        : Factor w/ 2 levels "exciting","No.exciting": 2 1 2 2 2 2 2 2 2 2 ...
##  $ relaxing        : Factor w/ 2 levels "No.relaxing",..: 1 1 2 2 2 2 2 2 2 2 ...
##  $ effect.on.health: Factor w/ 2 levels "effect on health",..: 2 2 2 2 2 2 2 2 2 2 ...
summary(tea)
##          breakfast           tea.time          evening          lunch    
##  breakfast    :144   Not.tea time:131   evening    :103   lunch    : 44  
##  Not.breakfast:156   tea time    :169   Not.evening:197   Not.lunch:256  
##                                                                          
##                                                                          
##                                                                          
##                                                                          
##                                                                          
##         dinner           always          home           work    
##  dinner    : 21   always    :103   home    :291   Not.work:213  
##  Not.dinner:279   Not.always:197   Not.home:  9   work    : 87  
##                                                                 
##                                                                 
##                                                                 
##                                                                 
##                                                                 
##         tearoom           friends          resto          pub     
##  Not.tearoom:242   friends    :196   Not.resto:221   Not.pub:237  
##  tearoom    : 58   Not.friends:104   resto    : 79   pub    : 63  
##                                                                   
##                                                                   
##                                                                   
##                                                                   
##                                                                   
##         Tea         How           sugar                     how     
##  black    : 74   alone:195   No.sugar:155   tea bag           :170  
##  Earl Grey:193   lemon: 33   sugar   :145   tea bag+unpackaged: 94  
##  green    : 33   milk : 63                  unpackaged        : 36  
##                  other:  9                                          
##                                                                     
##                                                                     
##                                                                     
##                   where                 price          age        sex    
##  chain store         :192   p_branded      : 95   Min.   :15.00   F:178  
##  chain store+tea shop: 78   p_cheap        :  7   1st Qu.:23.00   M:122  
##  tea shop            : 30   p_private label: 21   Median :32.00          
##                             p_unknown      : 12   Mean   :37.05          
##                             p_upscale      : 53   3rd Qu.:48.00          
##                             p_variable     :112   Max.   :90.00          
##                                                                          
##            SPC               Sport       age_Q          frequency  
##  employee    :59   Not.sportsman:121   15-24:92   1/day      : 95  
##  middle      :40   sportsman    :179   25-34:69   1 to 2/week: 44  
##  non-worker  :64                       35-44:40   +2/day     :127  
##  other worker:20                       45-59:61   3 to 6/week: 34  
##  senior      :35                       +60  :38                    
##  student     :70                                                   
##  workman     :12                                                   
##              escape.exoticism           spirituality        healthy   
##  escape-exoticism    :142     Not.spirituality:206   healthy    :210  
##  Not.escape-exoticism:158     spirituality    : 94   Not.healthy: 90  
##                                                                       
##                                                                       
##                                                                       
##                                                                       
##                                                                       
##          diuretic             friendliness            iron.absorption
##  diuretic    :174   friendliness    :242   iron absorption    : 31   
##  Not.diuretic:126   Not.friendliness: 58   Not.iron absorption:269   
##                                                                      
##                                                                      
##                                                                      
##                                                                      
##                                                                      
##          feminine             sophisticated        slimming          exciting  
##  feminine    :129   Not.sophisticated: 85   No.slimming:255   exciting   :116  
##  Not.feminine:171   sophisticated    :215   slimming   : 45   No.exciting:184  
##                                                                                
##                                                                                
##                                                                                
##                                                                                
##                                                                                
##         relaxing              effect.on.health
##  No.relaxing:113   effect on health   : 66    
##  relaxing   :187   No.effect on health:234    
##                                               
##                                               
##                                               
##                                               
## 

Tea dataset includes 36 variables describing tea-taking habits with 300 observations. Most of the variables are strings and some are bimodal. This dataset is too large for a reasonable analysis and therefore I have picked variables with health attributes.

# column names to keep in the dataset
keep_columns <- c("Tea", "sugar", "pub", "friends", "sport", "healthy", "effect.on.health", "slimming", "relaxing")

# selecting the 'keep_columns' to create a new dataset
tea_health <- select(tea, one_of(keep_columns))
## Warning: Unknown columns: `sport`
# look at the summaries and structure of the tea_health data
summary(tea_health)
##         Tea           sugar          pub             friends   
##  black    : 74   No.sugar:155   Not.pub:237   friends    :196  
##  Earl Grey:193   sugar   :145   pub    : 63   Not.friends:104  
##  green    : 33                                                 
##         healthy               effect.on.health        slimming  
##  healthy    :210   effect on health   : 66     No.slimming:255  
##  Not.healthy: 90   No.effect on health:234     slimming   : 45  
##                                                                 
##         relaxing  
##  No.relaxing:113  
##  relaxing   :187  
## 
str(tea_health)
## 'data.frame':    300 obs. of  8 variables:
##  $ Tea             : Factor w/ 3 levels "black","Earl Grey",..: 1 1 2 2 2 2 2 1 2 1 ...
##  $ sugar           : Factor w/ 2 levels "No.sugar","sugar": 2 1 1 2 1 1 1 1 1 1 ...
##  $ pub             : Factor w/ 2 levels "Not.pub","pub": 1 1 1 1 1 1 1 1 1 1 ...
##  $ friends         : Factor w/ 2 levels "friends","Not.friends": 2 2 1 2 2 2 1 2 2 2 ...
##  $ healthy         : Factor w/ 2 levels "healthy","Not.healthy": 1 1 1 1 2 1 1 1 2 1 ...
##  $ effect.on.health: Factor w/ 2 levels "effect on health",..: 2 2 2 2 2 2 2 2 2 2 ...
##  $ slimming        : Factor w/ 2 levels "No.slimming",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ relaxing        : Factor w/ 2 levels "No.relaxing",..: 1 1 2 2 2 2 2 2 2 2 ...
gather(tea_health) %>% ggplot(aes(value)) + facet_wrap("key", scales = "free") + geom_bar() + theme(axis.text.x = element_text(angle = 45, hjust = 1, size = 8))
## Warning: attributes are not identical across measure variables;
## they will be dropped

pairs <- ggpairs(tea_health)
pairs

Most of the tea drinkers drink earl grey with friends. 210 think tea is healthy but only 66 think tea has an effect on health. Majority (187) think tea is relaxing. Sugar is used more often with Earl Grey

# multiple correspondence analysis
mca <- MCA(tea_health, graph = FALSE)
# summary of the model
summary(mca)
## 
## Call:
## MCA(X = tea_health, graph = FALSE) 
## 
## 
## Eigenvalues
##                        Dim.1   Dim.2   Dim.3   Dim.4   Dim.5   Dim.6   Dim.7
## Variance               0.177   0.175   0.153   0.129   0.123   0.112   0.099
## % of var.             15.749  15.527  13.567  11.437  10.899   9.959   8.804
## Cumulative % of var.  15.749  31.275  44.842  56.279  67.178  77.137  85.940
##                        Dim.8   Dim.9
## Variance               0.087   0.071
## % of var.              7.731   6.329
## Cumulative % of var.  93.671 100.000
## 
## Individuals (the 10 first)
##                Dim.1    ctr   cos2    Dim.2    ctr   cos2    Dim.3    ctr
## 1           | -0.309  0.179  0.087 |  0.395  0.298  0.142 | -0.111  0.027
## 2           | -0.530  0.528  0.259 |  0.622  0.739  0.357 |  0.121  0.032
## 3           | -0.202  0.076  0.086 | -0.251  0.120  0.133 | -0.036  0.003
## 4           | -0.129  0.031  0.025 | -0.184  0.065  0.052 | -0.346  0.261
## 5           |  0.133  0.033  0.020 |  0.269  0.138  0.082 | -0.136  0.040
## 6           | -0.350  0.230  0.191 |  0.043  0.003  0.003 | -0.114  0.028
## 7           | -0.202  0.076  0.086 | -0.251  0.120  0.133 | -0.036  0.003
## 8           | -0.610  0.700  0.390 |  0.323  0.200  0.110 |  0.221  0.106
## 9           |  0.133  0.033  0.020 |  0.269  0.138  0.082 | -0.136  0.040
## 10          | -0.610  0.700  0.390 |  0.323  0.200  0.110 |  0.221  0.106
##               cos2  
## 1            0.011 |
## 2            0.014 |
## 3            0.003 |
## 4            0.181 |
## 5            0.021 |
## 6            0.020 |
## 7            0.003 |
## 8            0.051 |
## 9            0.021 |
## 10           0.051 |
## 
## Categories (the 10 first)
##                 Dim.1     ctr    cos2  v.test     Dim.2     ctr    cos2  v.test
## black       |  -0.512   4.559   0.086  -5.064 |   0.560   5.528   0.103   5.537
## Earl Grey   |   0.363   5.991   0.238   8.438 |  -0.379   6.599   0.259  -8.792
## green       |  -0.977   7.410   0.118  -5.940 |   0.959   7.243   0.114   5.831
## No.sugar    |  -0.360   4.719   0.138  -6.432 |   0.367   4.981   0.144   6.562
## sugar       |   0.385   5.044   0.138   6.432 |  -0.392   5.325   0.144  -6.562
## Not.pub     |  -0.041   0.092   0.006  -1.366 |   0.267   4.033   0.268   8.957
## pub         |   0.153   0.348   0.006   1.366 |  -1.005  15.170   0.268  -8.957
## friends     |   0.173   1.383   0.057   4.112 |  -0.340   5.416   0.218  -8.079
## Not.friends |  -0.326   2.607   0.057  -4.112 |   0.641  10.207   0.218   8.079
## healthy     |  -0.488  11.773   0.556 -12.896 |  -0.227   2.584   0.120  -5.999
##                 Dim.3     ctr    cos2  v.test  
## black       |   0.936  17.710   0.287   9.264 |
## Earl Grey   |  -0.108   0.613   0.021  -2.506 |
## green       |  -1.469  19.430   0.267  -8.928 |
## No.sugar    |   0.351   5.200   0.131   6.267 |
## sugar       |  -0.375   5.559   0.131  -6.267 |
## Not.pub     |  -0.092   0.546   0.032  -3.080 |
## pub         |   0.345   2.053   0.032   3.080 |
## friends     |   0.085   0.382   0.013   2.006 |
## Not.friends |  -0.159   0.720   0.013  -2.006 |
## healthy     |   0.021   0.025   0.001   0.549 |
## 
## Categorical variables (eta2)
##                    Dim.1 Dim.2 Dim.3  
## Tea              | 0.255 0.271 0.461 |
## sugar            | 0.138 0.144 0.131 |
## pub              | 0.006 0.268 0.032 |
## friends          | 0.057 0.218 0.013 |
## healthy          | 0.556 0.120 0.001 |
## effect.on.health | 0.345 0.131 0.181 |
## slimming         | 0.043 0.010 0.379 |
## relaxing         | 0.017 0.235 0.023 |
# visualize MCA
plot(mca, invisible=c("ind"), habillage = "quali")

With this analysis we can see that 31.275% of this models variance can be explained with the first 2 dimensions. It seems that social aspect is more affected by the Dim 2 and ohysical health with Dim 1. According to minor clusters it seems that people drinking tea with friens use sugar and they think tea is relaxing compaired with people drinking tea without friends use black tea with no sugar.


Chapter 6. Analysis of longitudinal data

library(dplyr)
library(tidyr)

###Load the data sets (BPRS and RATS)

BPRS <- read.table("https://raw.githubusercontent.com/KimmoVehkalahti/MABS/master/Examples/data/BPRS.txt", header = TRUE, sep = " ")
RATS <- read.table("https://raw.githubusercontent.com/KimmoVehkalahti/MABS/master/Examples/data/rats.txt", header = TRUE, sep = '\t')

###Inspecting the WIDE data

names(BPRS)
##  [1] "treatment" "subject"   "week0"     "week1"     "week2"     "week3"    
##  [7] "week4"     "week5"     "week6"     "week7"     "week8"
names(RATS)
##  [1] "ID"    "Group" "WD1"   "WD8"   "WD15"  "WD22"  "WD29"  "WD36"  "WD43" 
## [10] "WD44"  "WD50"  "WD57"  "WD64"
summary(BPRS)
##    treatment      subject          week0           week1           week2     
##  Min.   :1.0   Min.   : 1.00   Min.   :24.00   Min.   :23.00   Min.   :26.0  
##  1st Qu.:1.0   1st Qu.: 5.75   1st Qu.:38.00   1st Qu.:35.00   1st Qu.:32.0  
##  Median :1.5   Median :10.50   Median :46.00   Median :41.00   Median :38.0  
##  Mean   :1.5   Mean   :10.50   Mean   :48.00   Mean   :46.33   Mean   :41.7  
##  3rd Qu.:2.0   3rd Qu.:15.25   3rd Qu.:58.25   3rd Qu.:54.25   3rd Qu.:49.0  
##  Max.   :2.0   Max.   :20.00   Max.   :78.00   Max.   :95.00   Max.   :75.0  
##      week3           week4           week5           week6           week7     
##  Min.   :24.00   Min.   :20.00   Min.   :20.00   Min.   :19.00   Min.   :18.0  
##  1st Qu.:29.75   1st Qu.:28.00   1st Qu.:26.00   1st Qu.:22.75   1st Qu.:23.0  
##  Median :36.50   Median :34.50   Median :30.50   Median :28.50   Median :30.0  
##  Mean   :39.15   Mean   :36.35   Mean   :32.55   Mean   :31.23   Mean   :32.2  
##  3rd Qu.:44.50   3rd Qu.:43.00   3rd Qu.:38.00   3rd Qu.:37.00   3rd Qu.:38.0  
##  Max.   :76.00   Max.   :66.00   Max.   :64.00   Max.   :64.00   Max.   :62.0  
##      week8      
##  Min.   :20.00  
##  1st Qu.:22.75  
##  Median :28.00  
##  Mean   :31.43  
##  3rd Qu.:35.25  
##  Max.   :75.00
summary(RATS)
##        ID            Group           WD1             WD8             WD15      
##  Min.   : 1.00   Min.   :1.00   Min.   :225.0   Min.   :230.0   Min.   :230.0  
##  1st Qu.: 4.75   1st Qu.:1.00   1st Qu.:252.5   1st Qu.:255.0   1st Qu.:255.0  
##  Median : 8.50   Median :1.50   Median :340.0   Median :345.0   Median :347.5  
##  Mean   : 8.50   Mean   :1.75   Mean   :365.9   Mean   :369.1   Mean   :372.5  
##  3rd Qu.:12.25   3rd Qu.:2.25   3rd Qu.:480.0   3rd Qu.:476.2   3rd Qu.:486.2  
##  Max.   :16.00   Max.   :3.00   Max.   :555.0   Max.   :560.0   Max.   :565.0  
##       WD22            WD29            WD36            WD43      
##  Min.   :232.0   Min.   :240.0   Min.   :240.0   Min.   :243.0  
##  1st Qu.:267.2   1st Qu.:268.8   1st Qu.:267.2   1st Qu.:269.2  
##  Median :351.5   Median :356.5   Median :360.0   Median :360.0  
##  Mean   :379.2   Mean   :383.9   Mean   :387.0   Mean   :386.0  
##  3rd Qu.:492.5   3rd Qu.:497.8   3rd Qu.:504.2   3rd Qu.:501.0  
##  Max.   :580.0   Max.   :590.0   Max.   :597.0   Max.   :595.0  
##       WD44            WD50            WD57            WD64      
##  Min.   :244.0   Min.   :238.0   Min.   :247.0   Min.   :245.0  
##  1st Qu.:270.0   1st Qu.:273.8   1st Qu.:273.8   1st Qu.:278.0  
##  Median :362.0   Median :370.0   Median :373.5   Median :378.0  
##  Mean   :388.3   Mean   :394.6   Mean   :398.6   Mean   :404.1  
##  3rd Qu.:510.5   3rd Qu.:516.0   3rd Qu.:524.5   3rd Qu.:530.8  
##  Max.   :595.0   Max.   :612.0   Max.   :618.0   Max.   :628.0
str(BPRS)
## 'data.frame':    40 obs. of  11 variables:
##  $ treatment: int  1 1 1 1 1 1 1 1 1 1 ...
##  $ subject  : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ week0    : int  42 58 54 55 72 48 71 30 41 57 ...
##  $ week1    : int  36 68 55 77 75 43 61 36 43 51 ...
##  $ week2    : int  36 61 41 49 72 41 47 38 39 51 ...
##  $ week3    : int  43 55 38 54 65 38 30 38 35 55 ...
##  $ week4    : int  41 43 43 56 50 36 27 31 28 53 ...
##  $ week5    : int  40 34 28 50 39 29 40 26 22 43 ...
##  $ week6    : int  38 28 29 47 32 33 30 26 20 43 ...
##  $ week7    : int  47 28 25 42 38 27 31 25 23 39 ...
##  $ week8    : int  51 28 24 46 32 25 31 24 21 32 ...
str(RATS)
## 'data.frame':    16 obs. of  13 variables:
##  $ ID   : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ Group: int  1 1 1 1 1 1 1 1 2 2 ...
##  $ WD1  : int  240 225 245 260 255 260 275 245 410 405 ...
##  $ WD8  : int  250 230 250 255 260 265 275 255 415 420 ...
##  $ WD15 : int  255 230 250 255 255 270 260 260 425 430 ...
##  $ WD22 : int  260 232 255 265 270 275 270 268 428 440 ...
##  $ WD29 : int  262 240 262 265 270 275 273 270 438 448 ...
##  $ WD36 : int  258 240 265 268 273 277 274 265 443 460 ...
##  $ WD43 : int  266 243 267 270 274 278 276 265 442 458 ...
##  $ WD44 : int  266 244 267 272 273 278 271 267 446 464 ...
##  $ WD50 : int  265 238 264 274 276 284 282 273 456 475 ...
##  $ WD57 : int  272 247 268 273 278 279 281 274 468 484 ...
##  $ WD64 : int  278 245 269 275 280 281 284 278 478 496 ...

###Treatment weeks seems to be in different colums

###Convert categorical data to factors - BPRS

BPRS$treatment <- factor(BPRS$treatment)
BPRS$subject <- factor(BPRS$subject)

BPRSL <-  BPRS %>% tidyr::gather(key = weeks, value = bprs, -treatment, -subject)

BPRSL <-  BPRSL %>% dplyr::mutate(week = as.integer(substr(weeks, 5,5)))

###And the same preparation to RATS

Convert categorical data to factors RATS and Convert data to long form and extract the time.

RATS$ID <- factor(RATS$ID)
RATS$Group <- factor(RATS$Group)

RATSL <- RATS %>%
  tidyr::gather(key = WD, value = Weight, -ID, -Group) %>%
  dplyr::mutate(Time = as.integer(substr(WD, 3, 4))) 

glimpse(RATSL)
## Observations: 176
## Variables: 5
## $ ID     <fct> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 1, 2, 3…
## $ Group  <fct> 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 3, 3, 3, 3, 1, 1, 1, 1, 1,…
## $ WD     <chr> "WD1", "WD1", "WD1", "WD1", "WD1", "WD1", "WD1", "WD1", "WD1",…
## $ Weight <int> 240, 225, 245, 260, 255, 260, 275, 245, 410, 405, 445, 555, 47…
## $ Time   <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 8, 8, 8, 8, 8,…
str(RATSL)
## 'data.frame':    176 obs. of  5 variables:
##  $ ID    : Factor w/ 16 levels "1","2","3","4",..: 1 2 3 4 5 6 7 8 9 10 ...
##  $ Group : Factor w/ 3 levels "1","2","3": 1 1 1 1 1 1 1 1 2 2 ...
##  $ WD    : chr  "WD1" "WD1" "WD1" "WD1" ...
##  $ Weight: int  240 225 245 260 255 260 275 245 410 405 ...
##  $ Time  : int  1 1 1 1 1 1 1 1 1 1 ...
names(RATSL)
## [1] "ID"     "Group"  "WD"     "Weight" "Time"

In the long form time is in one column and in wide time variables are in different columns

#Access the package ggplot2
library(ggplot2)

##Exploring RATS and RATSL dataset

glimpse(RATS)
## Observations: 16
## Variables: 13
## $ ID    <fct> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16
## $ Group <fct> 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 3, 3, 3, 3
## $ WD1   <int> 240, 225, 245, 260, 255, 260, 275, 245, 410, 405, 445, 555, 470…
## $ WD8   <int> 250, 230, 250, 255, 260, 265, 275, 255, 415, 420, 445, 560, 465…
## $ WD15  <int> 255, 230, 250, 255, 255, 270, 260, 260, 425, 430, 450, 565, 475…
## $ WD22  <int> 260, 232, 255, 265, 270, 275, 270, 268, 428, 440, 452, 580, 485…
## $ WD29  <int> 262, 240, 262, 265, 270, 275, 273, 270, 438, 448, 455, 590, 487…
## $ WD36  <int> 258, 240, 265, 268, 273, 277, 274, 265, 443, 460, 455, 597, 493…
## $ WD43  <int> 266, 243, 267, 270, 274, 278, 276, 265, 442, 458, 451, 595, 493…
## $ WD44  <int> 266, 244, 267, 272, 273, 278, 271, 267, 446, 464, 450, 595, 504…
## $ WD50  <int> 265, 238, 264, 274, 276, 284, 282, 273, 456, 475, 462, 612, 507…
## $ WD57  <int> 272, 247, 268, 273, 278, 279, 281, 274, 468, 484, 466, 618, 518…
## $ WD64  <int> 278, 245, 269, 275, 280, 281, 284, 278, 478, 496, 472, 628, 525…
head(RATS)
##   ID Group WD1 WD8 WD15 WD22 WD29 WD36 WD43 WD44 WD50 WD57 WD64
## 1  1     1 240 250  255  260  262  258  266  266  265  272  278
## 2  2     1 225 230  230  232  240  240  243  244  238  247  245
## 3  3     1 245 250  250  255  262  265  267  267  264  268  269
## 4  4     1 260 255  255  265  265  268  270  272  274  273  275
## 5  5     1 255 260  255  270  270  273  274  273  276  278  280
## 6  6     1 260 265  270  275  275  277  278  278  284  279  281
str(RATS)
## 'data.frame':    16 obs. of  13 variables:
##  $ ID   : Factor w/ 16 levels "1","2","3","4",..: 1 2 3 4 5 6 7 8 9 10 ...
##  $ Group: Factor w/ 3 levels "1","2","3": 1 1 1 1 1 1 1 1 2 2 ...
##  $ WD1  : int  240 225 245 260 255 260 275 245 410 405 ...
##  $ WD8  : int  250 230 250 255 260 265 275 255 415 420 ...
##  $ WD15 : int  255 230 250 255 255 270 260 260 425 430 ...
##  $ WD22 : int  260 232 255 265 270 275 270 268 428 440 ...
##  $ WD29 : int  262 240 262 265 270 275 273 270 438 448 ...
##  $ WD36 : int  258 240 265 268 273 277 274 265 443 460 ...
##  $ WD43 : int  266 243 267 270 274 278 276 265 442 458 ...
##  $ WD44 : int  266 244 267 272 273 278 271 267 446 464 ...
##  $ WD50 : int  265 238 264 274 276 284 282 273 456 475 ...
##  $ WD57 : int  272 247 268 273 278 279 281 274 468 484 ...
##  $ WD64 : int  278 245 269 275 280 281 284 278 478 496 ...
summary(RATS)
##        ID     Group      WD1             WD8             WD15      
##  1      : 1   1:8   Min.   :225.0   Min.   :230.0   Min.   :230.0  
##  2      : 1   2:4   1st Qu.:252.5   1st Qu.:255.0   1st Qu.:255.0  
##  3      : 1   3:4   Median :340.0   Median :345.0   Median :347.5  
##  4      : 1         Mean   :365.9   Mean   :369.1   Mean   :372.5  
##  5      : 1         3rd Qu.:480.0   3rd Qu.:476.2   3rd Qu.:486.2  
##  6      : 1         Max.   :555.0   Max.   :560.0   Max.   :565.0  
##  (Other):10                                                        
##       WD22            WD29            WD36            WD43      
##  Min.   :232.0   Min.   :240.0   Min.   :240.0   Min.   :243.0  
##  1st Qu.:267.2   1st Qu.:268.8   1st Qu.:267.2   1st Qu.:269.2  
##  Median :351.5   Median :356.5   Median :360.0   Median :360.0  
##  Mean   :379.2   Mean   :383.9   Mean   :387.0   Mean   :386.0  
##  3rd Qu.:492.5   3rd Qu.:497.8   3rd Qu.:504.2   3rd Qu.:501.0  
##  Max.   :580.0   Max.   :590.0   Max.   :597.0   Max.   :595.0  
##                                                                 
##       WD44            WD50            WD57            WD64      
##  Min.   :244.0   Min.   :238.0   Min.   :247.0   Min.   :245.0  
##  1st Qu.:270.0   1st Qu.:273.8   1st Qu.:273.8   1st Qu.:278.0  
##  Median :362.0   Median :370.0   Median :373.5   Median :378.0  
##  Mean   :388.3   Mean   :394.6   Mean   :398.6   Mean   :404.1  
##  3rd Qu.:510.5   3rd Qu.:516.0   3rd Qu.:524.5   3rd Qu.:530.8  
##  Max.   :595.0   Max.   :612.0   Max.   :618.0   Max.   :628.0  
## 
glimpse(RATSL)
## Observations: 176
## Variables: 5
## $ ID     <fct> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 1, 2, 3…
## $ Group  <fct> 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 3, 3, 3, 3, 1, 1, 1, 1, 1,…
## $ WD     <chr> "WD1", "WD1", "WD1", "WD1", "WD1", "WD1", "WD1", "WD1", "WD1",…
## $ Weight <int> 240, 225, 245, 260, 255, 260, 275, 245, 410, 405, 445, 555, 47…
## $ Time   <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 8, 8, 8, 8, 8,…
head(RATSL)
##   ID Group  WD Weight Time
## 1  1     1 WD1    240    1
## 2  2     1 WD1    225    1
## 3  3     1 WD1    245    1
## 4  4     1 WD1    260    1
## 5  5     1 WD1    255    1
## 6  6     1 WD1    260    1
str(RATSL)
## 'data.frame':    176 obs. of  5 variables:
##  $ ID    : Factor w/ 16 levels "1","2","3","4",..: 1 2 3 4 5 6 7 8 9 10 ...
##  $ Group : Factor w/ 3 levels "1","2","3": 1 1 1 1 1 1 1 1 2 2 ...
##  $ WD    : chr  "WD1" "WD1" "WD1" "WD1" ...
##  $ Weight: int  240 225 245 260 255 260 275 245 410 405 ...
##  $ Time  : int  1 1 1 1 1 1 1 1 1 1 ...
summary(RATSL)
##        ID      Group       WD                Weight           Time      
##  1      : 11   1:88   Length:176         Min.   :225.0   Min.   : 1.00  
##  2      : 11   2:44   Class :character   1st Qu.:267.0   1st Qu.:15.00  
##  3      : 11   3:44   Mode  :character   Median :344.5   Median :36.00  
##  4      : 11                             Mean   :384.5   Mean   :33.55  
##  5      : 11                             3rd Qu.:511.2   3rd Qu.:50.00  
##  6      : 11                             Max.   :628.0   Max.   :64.00  
##  (Other):110

From the Wide dataset we can see that we have 3 groups with Group 1 : 8 rats, Group 2 : 4 rats and Group 3 : 4 rats. Weight is measured in 11 timepoints 7 WD apart. Overall weight seems to increase over time.

ggplot(RATSL, aes(x = Time, y = Weight, linetype = ID)) +
  geom_line() +
  scale_linetype_manual(values = rep(1:10, times=8)) +
  facet_grid(. ~ Group, labeller = label_both) +
  theme(legend.position = "none") + 
  scale_y_continuous(limits = c(min(RATSL$Weight), max(RATSL$Weight)))

Next weight is standardized with SD and same plot is drawn.

RATSL <- RATSL %>%
  group_by(Time) %>%
  mutate( stdWeight = (Weight - mean(Weight))/sd(Weight) ) %>%
  ungroup()
glimpse(RATSL)
## Observations: 176
## Variables: 6
## $ ID        <fct> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 1, 2…
## $ Group     <fct> 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 3, 3, 3, 3, 1, 1, 1, 1,…
## $ WD        <chr> "WD1", "WD1", "WD1", "WD1", "WD1", "WD1", "WD1", "WD1", "WD…
## $ Weight    <int> 240, 225, 245, 260, 255, 260, 275, 245, 410, 405, 445, 555,…
## $ Time      <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 8, 8, 8, 8,…
## $ stdWeight <dbl> -1.0011429, -1.1203857, -0.9613953, -0.8421525, -0.8819001,…
ggplot(RATSL, aes(x = Time, y = stdWeight, linetype = ID)) +
  geom_line() +
  scale_linetype_manual(values = rep(1:10, times=8)) +
  facet_grid(. ~ Group, labeller = label_both) +
  theme(legend.position = "none") + 
  scale_y_continuous(limits = c(min(RATSL$stdWeight), max(RATSL$stdWeight)))

Drawing the lines of individual rats within different groups over time. On X axis we have timepoints and on Y-axis we have weight in grams. Individuals are marked with a different linetype. Different groups are in separate graphs.
In this individual graph we can clearly see that the baseline is much lower in group 1. In group 2 is an outlier with higher starting weight which increases the total mean value.

With standardization weight mean is 0 with SD of 1. Standardization is done by grouping by time, so the time variable doesn´t show similar trend as before.

##Combining individuals to a group mean

RATSL2 <- RATSL
RATSL2 <- within(RATSL2, {Time <- factor(Time)})
RATSL2.aov <- aov(Weight ~ Group * Time + Error(ID), data = RATSL2)
summary(RATSL2.aov)
## 
## Error: ID
##           Df  Sum Sq Mean Sq F value   Pr(>F)    
## Group      2 2604050 1302025   88.07 2.76e-08 ***
## Residuals 13  192186   14784                     
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Error: Within
##             Df Sum Sq Mean Sq F value   Pr(>F)    
## Time        10  23400  2340.0  58.838  < 2e-16 ***
## Group:Time  20   4906   245.3   6.168 2.88e-11 ***
## Residuals  130   5170    39.8                     
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
n <- RATSL$Time %>% unique() %>% length()
RATSS <- RATSL %>%
  group_by(Group, Time) %>%
  summarise(mean = mean(Weight), se = sd(Weight)/sqrt(n) ) %>%
  ungroup()
glimpse(RATSS)
## Observations: 33
## Variables: 4
## $ Group <fct> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, …
## $ Time  <int> 1, 8, 15, 22, 29, 36, 43, 44, 50, 57, 64, 1, 8, 15, 22, 29, 36,…
## $ mean  <dbl> 250.625, 255.000, 254.375, 261.875, 264.625, 265.000, 267.375, …
## $ se    <dbl> 4.589478, 3.947710, 3.460116, 4.100800, 3.333956, 3.552939, 3.3…

Glimpse the data

glimpse(RATSS)
## Observations: 33
## Variables: 4
## $ Group <fct> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, …
## $ Time  <int> 1, 8, 15, 22, 29, 36, 43, 44, 50, 57, 64, 1, 8, 15, 22, 29, 36,…
## $ mean  <dbl> 250.625, 255.000, 254.375, 261.875, 264.625, 265.000, 267.375, …
## $ se    <dbl> 4.589478, 3.947710, 3.460116, 4.100800, 3.333956, 3.552939, 3.3…

Plotting the mean profiles

ggplot(RATSS, aes(x = Time, y = mean, linetype = Group, shape = Group)) +
  geom_line() +
  scale_linetype_manual(values = c(1:3)) +
  geom_point(size=3) +
  scale_shape_manual(values = c(1:3)) +
  #geom_errorbar(aes(ymin=mean-se, ymax=mean+se, linetype="1"), width=0.3) +
  theme(legend.position = c(0.8,0.8)) +
  scale_y_continuous(name = "mean(rats) +/- se(rats)")

The between groups test indicates that the variable group is significant, consequently in the graph we see that the lines for the three groups are rather far apart. The within subject test indicate that there is a significant time effect, in other words, the groups show increase in weight over time. The slopes of the lines are increasing in all groups. One group starts at a lower mean of weight and increase over time seems to be less than in other two groups.

From the previous graphs we could see already that group 2 has an outlier, but let us see that in better way - boxplot

ggplot(RATSL2, aes(x = factor(Time), y = Weight, fill = Group)) + geom_boxplot(position = position_dodge(width = 0.9)) + scale_x_discrete(name = "Time") 

RATSL8S <- RATSL %>%
  filter(Time > 0) %>%
  group_by(Group, ID) %>%
  summarise(mean=mean(Weight) ) %>%
  ungroup()

Glimpse the data

glimpse(RATSL8S)
## Observations: 16
## Variables: 3
## $ Group <fct> 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 3, 3, 3, 3
## $ ID    <fct> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16
## $ mean  <dbl> 261.0909, 237.6364, 260.1818, 266.5455, 269.4545, 274.7273, 274…

Drawing a boxplot of the mean versus treatment

ggplot(RATSL8S, aes(x = Group, y = mean)) +
  geom_boxplot() +
  stat_summary(fun.y = "mean", geom = "point", shape=23, size=4, fill = "white") +
  scale_y_continuous(name = "mean(Weight), Days")

From the boxplot we can see the group 2 outlier clearly, but also in groups 1 and 3 are single outliers but they follow means trend. For sciences sake let us make these groups smaller and filter outliers.

RATSL8S1 <- RATSL8S %>%
  filter(mean < 550)
RATSL8S1 <- RATSL8S %>%
  filter(mean > 240)
RATSL8SG3 <- subset(RATSL8S, Group==3, select=c(Group, ID, mean)) %>%
  filter(mean > 500)
RATSL8SG12 <- subset(RATSL8S1, Group==1:2, select=c(Group, ID, mean))
## Warning in `==.default`(Group, 1:2): longer object length is not a multiple of
## shorter object length
## Warning in is.na(e1) | is.na(e2): longer object length is not a multiple of
## shorter object length
RATSL8S1 <- rbind(RATSL8SG12, RATSL8SG3)

Drawing a boxplot without outliers of the mean versus Group

ggplot(RATSL8S1, aes(x = Group, y = mean)) +
  geom_boxplot() +
  stat_summary(fun.y = "mean", geom = "point", shape=23, size=4, fill = "white") +
  scale_y_continuous(name = "mean(Weight), Days")

Without outliers groupmeans are very different

##ANOVA

Next We use anova for analysis of variance, t-test canẗ be used because variables aren´t bimodal.

fit1 <- lm(mean ~ Group, data = RATSL8S1)
summary(fit1)
## 
## Call:
## lm(formula = mean ~ Group, data = RATSL8S1)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -7.0000 -2.9394 -0.4848  3.4242  7.7727 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  266.955      2.999   89.02 1.35e-10 ***
## Group2       180.864      5.194   34.82 3.74e-08 ***
## Group3       269.803      4.581   58.90 1.61e-09 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.997 on 6 degrees of freedom
## Multiple R-squared:  0.9984, Adjusted R-squared:  0.9978 
## F-statistic:  1827 on 2 and 6 DF,  p-value: 4.408e-09
anova(fit1)
## Analysis of Variance Table
## 
## Response: mean
##           Df Sum Sq Mean Sq F value    Pr(>F)    
## Group      2 131409   65704  1826.7 4.408e-09 ***
## Residuals  6    216      36                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Here we can see the group means signifiqant difference.

Using baseline as a covariate to the model we can see if there is still difference ?

baseline <- RATS$WD1
RATSL8S2 <- RATSL8S %>%
  mutate(baseline)
fit2 <- lm(mean ~ baseline + Group, data = RATSL8S2)
summary(fit2)
## 
## Call:
## lm(formula = mean ~ baseline + Group, data = RATSL8S2)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -21.732  -3.812   1.991   6.889  13.455 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 30.14886   19.88779   1.516   0.1554    
## baseline     0.93194    0.07793  11.959 5.02e-08 ***
## Group2      31.68866   17.11189   1.852   0.0888 .  
## Group3      21.52296   21.13931   1.018   0.3287    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 10.62 on 12 degrees of freedom
## Multiple R-squared:  0.9947, Adjusted R-squared:  0.9933 
## F-statistic: 747.8 on 3 and 12 DF,  p-value: 6.636e-14
anova(fit2)
## Analysis of Variance Table
## 
## Response: mean
##           Df Sum Sq Mean Sq   F value    Pr(>F)    
## baseline   1 252125  252125 2237.0655 5.217e-15 ***
## Group      2    726     363    3.2219   0.07586 .  
## Residuals 12   1352     113                        
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Signifiqance was lost. So the baseline should be taken account choosing a treatment group.

##Season greetings from a mental institute

The other dataset shows the brief psychiatric rating scale measurements from 40 males weekly for eight weeks. Subjecs were ramdomly selected to two treatment groups.

Wide and long form comparison was already done in the beginning of this exercise. All 40 subjects have 9 measurements so in the long set there are 360 measurements.

A glimpse at the BPRSL data

glimpse(BPRSL)
## Observations: 360
## Variables: 5
## $ treatment <fct> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,…
## $ subject   <fct> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, …
## $ weeks     <chr> "week0", "week0", "week0", "week0", "week0", "week0", "week…
## $ bprs      <int> 42, 58, 54, 55, 72, 48, 71, 30, 41, 57, 30, 55, 36, 38, 66,…
## $ week      <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
str(BPRS)
## 'data.frame':    40 obs. of  11 variables:
##  $ treatment: Factor w/ 2 levels "1","2": 1 1 1 1 1 1 1 1 1 1 ...
##  $ subject  : Factor w/ 20 levels "1","2","3","4",..: 1 2 3 4 5 6 7 8 9 10 ...
##  $ week0    : int  42 58 54 55 72 48 71 30 41 57 ...
##  $ week1    : int  36 68 55 77 75 43 61 36 43 51 ...
##  $ week2    : int  36 61 41 49 72 41 47 38 39 51 ...
##  $ week3    : int  43 55 38 54 65 38 30 38 35 55 ...
##  $ week4    : int  41 43 43 56 50 36 27 31 28 53 ...
##  $ week5    : int  40 34 28 50 39 29 40 26 22 43 ...
##  $ week6    : int  38 28 29 47 32 33 30 26 20 43 ...
##  $ week7    : int  47 28 25 42 38 27 31 25 23 39 ...
##  $ week8    : int  51 28 24 46 32 25 31 24 21 32 ...

Plotting groups, visualization

ggplot(BPRSL, aes(x = week, y = bprs, linetype = subject)) +
  geom_line() +
  scale_linetype_manual(values = rep(1:10, times=4)) +
  facet_grid(. ~ treatment, labeller = label_both) +
  theme(legend.position = "none") + 
  scale_y_continuous(limits = c(min(BPRSL$bprs), max(BPRSL$bprs)))

It seems like there is an declining trend in both groups but there is quite a lot of noice. We do need a way to build a model and analyse if treatments differe using BPRS points as a response variable.

###First a linear model

fitbprs <- lm(bprs ~ treatment, data = BPRSL)
summary(fitbprs)
## 
## Call:
## lm(formula = bprs ~ treatment, data = BPRSL)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -19.944 -10.372  -2.372   5.628  57.056 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  37.3722     1.0193  36.663   <2e-16 ***
## treatment2    0.5722     1.4416   0.397    0.692    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 13.68 on 358 degrees of freedom
## Multiple R-squared:  0.0004399,  Adjusted R-squared:  -0.002352 
## F-statistic: 0.1576 on 1 and 358 DF,  p-value: 0.6916

This assumes independence of repeated measures, but this is not the case in repeated measures from same individuals, so we need to use appropriate models like the random intercept model which uses two explanatory variables (week and treatment) fits linear regression fit for each individual to differ in intercept from other individuals. BPRS scores doesn´t seem to correlate with the treatment.

The Random Intercept Model

library(lme4)
## Loading required package: Matrix
## 
## Attaching package: 'Matrix'
## The following objects are masked from 'package:tidyr':
## 
##     expand, pack, unpack
BPRS_ref <- lmer(bprs ~ week + treatment + (1 | subject), data = BPRSL, REML = FALSE)
summary(BPRS_ref)
## Linear mixed model fit by maximum likelihood  ['lmerMod']
## Formula: bprs ~ week + treatment + (1 | subject)
##    Data: BPRSL
## 
##      AIC      BIC   logLik deviance df.resid 
##   2748.7   2768.1  -1369.4   2738.7      355 
## 
## Scaled residuals: 
##     Min      1Q  Median      3Q     Max 
## -3.0481 -0.6749 -0.1361  0.4813  3.4855 
## 
## Random effects:
##  Groups   Name        Variance Std.Dev.
##  subject  (Intercept)  47.41    6.885  
##  Residual             104.21   10.208  
## Number of obs: 360, groups:  subject, 20
## 
## Fixed effects:
##             Estimate Std. Error t value
## (Intercept)  46.4539     1.9090  24.334
## week         -2.2704     0.2084 -10.896
## treatment2    0.5722     1.0761   0.532
## 
## Correlation of Fixed Effects:
##            (Intr) week  
## week       -0.437       
## treatment2 -0.282  0.000

 

Same type of result

Random intercept and random slope model

BPRS_ref1 <- lmer(bprs ~ week + treatment + (week | subject), data = BPRSL, REML = FALSE)
summary(BPRS_ref1)
## Linear mixed model fit by maximum likelihood  ['lmerMod']
## Formula: bprs ~ week + treatment + (week | subject)
##    Data: BPRSL
## 
##      AIC      BIC   logLik deviance df.resid 
##   2745.4   2772.6  -1365.7   2731.4      353 
## 
## Scaled residuals: 
##     Min      1Q  Median      3Q     Max 
## -2.8919 -0.6194 -0.0691  0.5531  3.7977 
## 
## Random effects:
##  Groups   Name        Variance Std.Dev. Corr 
##  subject  (Intercept) 64.8222  8.0512        
##           week         0.9609  0.9803   -0.51
##  Residual             97.4304  9.8707        
## Number of obs: 360, groups:  subject, 20
## 
## Fixed effects:
##             Estimate Std. Error t value
## (Intercept)  46.4539     2.1052  22.066
## week         -2.2704     0.2977  -7.626
## treatment2    0.5722     1.0405   0.550
## 
## Correlation of Fixed Effects:
##            (Intr) week  
## week       -0.582       
## treatment2 -0.247  0.000
anova(BPRS_ref1, BPRS_ref)
## Data: BPRSL
## Models:
## BPRS_ref: bprs ~ week + treatment + (1 | subject)
## BPRS_ref1: bprs ~ week + treatment + (week | subject)
##           Df    AIC    BIC  logLik deviance  Chisq Chi Df Pr(>Chisq)  
## BPRS_ref   5 2748.7 2768.1 -1369.4   2738.7                           
## BPRS_ref1  7 2745.4 2772.6 -1365.7   2731.4 7.2721      2    0.02636 *
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

  Model itself doesn´t bring much new information, but the model seems to fit better using anova test.

Random Intercept and Random Slope Model with interaction

BPRS_ref2 <- lmer(bprs ~ week * treatment + (week | subject), data = BPRSL, REML = FALSE)
summary(BPRS_ref2)
## Linear mixed model fit by maximum likelihood  ['lmerMod']
## Formula: bprs ~ week * treatment + (week | subject)
##    Data: BPRSL
## 
##      AIC      BIC   logLik deviance df.resid 
##   2744.3   2775.4  -1364.1   2728.3      352 
## 
## Scaled residuals: 
##     Min      1Q  Median      3Q     Max 
## -3.0512 -0.6271 -0.0768  0.5288  3.9260 
## 
## Random effects:
##  Groups   Name        Variance Std.Dev. Corr 
##  subject  (Intercept) 64.9964  8.0620        
##           week         0.9687  0.9842   -0.51
##  Residual             96.4707  9.8220        
## Number of obs: 360, groups:  subject, 20
## 
## Fixed effects:
##                 Estimate Std. Error t value
## (Intercept)      47.8856     2.2521  21.262
## week             -2.6283     0.3589  -7.323
## treatment2       -2.2911     1.9090  -1.200
## week:treatment2   0.7158     0.4010   1.785
## 
## Correlation of Fixed Effects:
##             (Intr) week   trtmn2
## week        -0.650              
## treatment2  -0.424  0.469       
## wek:trtmnt2  0.356 -0.559 -0.840
anova(BPRS_ref1, BPRS_ref2)
## Data: BPRSL
## Models:
## BPRS_ref1: bprs ~ week + treatment + (week | subject)
## BPRS_ref2: bprs ~ week * treatment + (week | subject)
##           Df    AIC    BIC  logLik deviance  Chisq Chi Df Pr(>Chisq)  
## BPRS_ref1  7 2745.4 2772.6 -1365.7   2731.4                           
## BPRS_ref2  8 2744.3 2775.4 -1364.1   2728.3 3.1712      1    0.07495 .
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
p1 <- ggplot(BPRSL, aes(x = week, y = bprs, group=interaction(treatment, subject))) +
  scale_x_continuous(name = "Week", breaks = seq(0, 60, 20)) +
  scale_y_continuous(name = "bprs")
p1 + geom_line()

Fitted <- fitted(BPRS_ref2)
BPRSL <- BPRSL %>%
  mutate(Fitted)
p2 <- ggplot(BPRSL, aes(x = week, y = Fitted, group=interaction(treatment, subject))) +
  scale_x_continuous(name = "Week", breaks = seq(0, 60, 20)) +
  scale_y_continuous(name = "bprs")
p2 + geom_line()

Random Intercept and Random Slope Model with interaction seems to describe observations well.